Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do Subversion give some of my UTF-8 text files content type "application/octet-stream"?

Tags:

svn

encoding

I got a handful of UTF-8-encoded text files (with text in japanese), and added them to a Subversion repository.

To my surprise, one of them got the auto-property svn:mime-type set to application/octet-stream, while the others did not get any specific encoding information.

The files are valid UTF-8, file reports "UTF-8 Unicode text, with CRLF line terminators" for all of them.

What is going on here? How does Subversion decide if a file should be treated as binary or not?

like image 955
Anders Lindahl Avatar asked Oct 28 '25 13:10

Anders Lindahl


1 Answers

I found the explanation in the Subversion sources, in svn_io_is_binary_data:

/* Right now, this function is going to be really stupid.  It's
  going to examine the block of data, and make sure that 15%
  of the bytes are such that their value is in the ranges 0x07-0x0D
  or 0x20-0x7F, and that none of those bytes is 0x00.  If those
  criteria are not met, we're calling it binary.

  NOTE:  Originally, I intended to target 85% of the bytes being in
  the specified ranges, but I flubbed the condition.  At any rate,
  folks aren't complaining, so I'm not sure that it's worth
  adjusting this retroactively now.  --cmpilato  */

With Japanese text in UTF-8, most code points will use three bytes, each of which being >= 0x80.

The reason not more of my files triggered this behavior was a small preamble with chars in the ASCII range.

like image 180
Anders Lindahl Avatar answered Oct 30 '25 14:10

Anders Lindahl