I have some character strings which I'm getting from an html. Turns out, these strings have some hidden characters or controls (?).
How can I convert this string so that it only contains the visible characters?
Take for example the term "Besucherüberblick" and its raw representation:
charToRaw("Besucherüberblick")
[1] 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b
However, from my html, I'm getting:
[1] e2 80 8c 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b
So there are these three weird thingies at the beginning.
I could probably trial and error and manually remove these from my raw vector and then convert it back to character, but a) I don't know in advance which strings the html will give me and b) I'm looking for an automated solution.
Maybe there's some stringr/stringi solution to it?
Those first three bytes (e2 80 8c) are the UTF-8 encoding for the zero width non-joiner unicode character. You can remove those all other other non-printable control characters with the \p{Format} regular expression class which should contain the invisible formatting indicators (see other groups here). You can view the ~160 characters in that class here.
x <- rawToChar(as.raw(c(226, 128, 140, 66, 101, 115, 117, 99, 104, 101, 114, 195, 188,
98, 101, 114, 98, 108, 105, 99, 107)))
x
# [1] "Besucherüberblick"
charToRaw(x)
# [1] e2 80 8c 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b
y <- stringr::str_remove_all(x, "[\\p{Format}]")
y
# [1] "Besucherüberblick"
charToRaw(y)
# [1] 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b
Another good choice might be \p{Other} if you want to exclude other control characters or unassigned values, etc. That will exclude all the following categories: \p{Control} (an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F which include things like tabs and newline characters), \p{Format} (invisible formatting indicator), \p{Private_Use}: (any code point reserved for private use), \p{Surrogate} (one half of a surrogate pair in UTF-16 encoding) and \p{Unassigned} (any code point to which no character has been assigned)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With