I just want to check my own sanity with this question here. I have a filename which has a +
(plus) character in it, which is perfectly valid on some operating systems and filesystems (e.g. MacOS and HFS+).
However, I am seeing an issue where I think that java.io.File#toURI()
is not operating correctly.
For example:
new File("hello+world.txt").toURI().toString()
On my Mac machine returns:
file:/Users/aretter/code/rocksdb/hello+world.txt
However IMHO, that is not correct, because the +
(plus) character from the filename has not been encoded in the URI. The URI does not represent the original filename at all, a +
in a URI has a very different meaning to a +
character in a filename.
So if we decode the URI, the plus will now be replaced with a (space) character, and we have lost information. e.g.:
URLDecoder.decode(new File("hello+world.txt").toURI().toURL().toString)
Which results in:
file:/Users/aretter/code/rocksdb/hello world.txt
What I would have expected instead would be something like:
new File("hello+world.txt").toURI().toString()
resulting in:
file:/Users/aretter/code/rocksdb/hello%2Bworld.txt
So that when it is later used and decoded the plus sign is preserved.
I am struggling to believe that such an obvious bug could be present in Java SE. Can someone point out where I am mistaken?
Also, if there is a workaround, I would like to hear about it please? Keep in mind that I am not actually providing static strings as filenames to File, but rather reading a directory of files from disk, of which some of those files may contain a +
(plus) character.
Let me try to clarify,
'+' plus character is threat as a normal character in context of URL and it is not encoded in any form (e.g. %20).
So when you call the new File("hello+world.txt").toURI().toString()
does not perform any encoding for '+' character(simply because it is not required).
Now come to URLDecoder
, this class is an utility class for HTML form decoding. It treat the '+' plus as encoded character and hence decode it to ' ' space character. In your example, this class tread the URI's to string value as normal html form field's value (not the URI value). This class should never be used to decode the full URI/URL value as it is not designed for this purpose)
From java docs of URLDecoder#decode(String),
Decodes a x-www-form-urlencoded string. The platform's default encoding is used to determine what characters are represented by any consecutive sequences of the form "%xy".
Hope it helps.
Update #1 based on comments:
As per section 2.2, If data for a URI component has conflicts with a reserved character, then the conflicting data must be percent-encoded before the URI is formed.
It is also an important point that different parts of URI has different set of reserved words depending on the their context. For example, /
sign is reserved only in path part of URI, +
sign is reserved in query string part. So there is no need to escape /
in query part and similarly there is no need to escape +
in path part.
In your example, URI producer File.toURI
does not encode + sign in path part of URI (since +' is not considered as reserved word in path part) and you see the
+' sign in to URI's to string representation.
You may refers to URI recommendation for more details.
Related answer:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With