I'm using URLDecoder to decode a string:
import java.net.URLDecoder;
URLDecoder.decode("%u6EDA%u52A8%u8F74%u627F", StandardCharsets.UTF_8.name());
Which leads to the crash
Exception in thread "main" java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u6"
at java.net.URLDecoder.decode(URLDecoder.java:194)
at Playground$.delayedEndpoint$Playground$1(Playground.scala:45)
at Playground$delayedInit$body.apply(Playground.scala:10)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at Playground$.main(Playground.scala:10)
at Playground.main(Playground.scala)
It seems like %u6 and %u8 are not allowed in the string. I've tried to read up on what these symbols are, but I've been unsuccessful. I found the string in a dataset in a field called "page title field". So I'm suspecting they are encoded symbols, I just don't know which encoding. Does anyone know what these symbols are and which encoding I should use to successfully decode them?
Looks like a non-standard UTF-16-based encoding of "滚动轴承", which is Chinese for "ball bearings".
I'd suggest to just .replaceAll %u by backslashes, and then use StringEscapeUtils from Apache Commons:
import org.apache.commons.lang3.StringEscapeUtils
val unescapedJava = StringEscapeUtils.unescapeJava(str.replaceAll("%u", "\\u"))
URLDecoder.decode(unescapedJava, StandardCharsets.UTF_8.name())
This should handle both kinds of escaping:
% followed by digits are unaffected by the replacement and unescapeJava%u are treated specially (replaced by \u), and eliminated in the first step.If (only if) you are absolutely certain that all code points got encoded in this way, then you can do without StringEscapeUtils:
new String(
"%u6EDA%u52A8%u8F74%u627F"
.replaceAll("%u", "")
.grouped(4)
.map(Integer.parseInt(_, 16).toChar)
.toArray
)
which produces
res: String = 滚动轴承
but I'd advice against it, because this method will break down for
inputs like "%u6EDA%u52A8%u8F74%u627Fcafebabe" that contain unescaped characters.
Better use a reliable library method that handles all corner cases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With