I have an R tibble with UTF-8 character column. When I print the contents of this column for a certain problematic record, everything looks fine: one two three. There are, however, problems when I try to use this string in a RDBMS query which I construct in R and send to the database.
If I copy this string to Notepad++ and convert the encoding to ANSI, I can see that the string actually contains some additional characters that cause the problem: one â€two‬ three.
A partial solution that works would be conversion to ASCII:
iconv(my_string, "UTF-8", "ASCII", sub = "")
, but all non-ASCII characters are lost here.
Conversion from UTF-8 to UTF-8 doesn't solve my problem:
iconv(my_string, "UTF-8", "UTF-8", sub = "").
Is it possible to remove all invisible characters like the ones above without losing the UTF-8 encoding? That is: how can I convert my string to the form that I see when I print it out in R (without hidden parts)?
Not sure I completely understand what you are trying to do, but you can use stringi or stringr to explicitly specify what characters you want to retain. For your example, it could look something like this. You may have to expand the characters you want to retain, but this approach is one option:
library(stringr)
my_string <- "one â€two‬ three"
# Specifying that you only want upper and lowercase letters,
# numbers, punctuation, and whitespace.
str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\\s]")
[1] "one two three"
# Just checking
stringi::stri_enc_isutf8(str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\\s]"))
[1] TRUE
EDIT: I do want to note that you should check and see how robust this approach is. I have not dealt with invisible characters often so this may not be the best way to go about removing them.
You haven't given us a way to construct your bad string so I can't test this on your data, but it works on this example.
badString <- "one \u200Btwo\u200B three"
chars <- strsplit(badString, "")[[1]] # Assume badString has one entry; if not, add a loop
chars <- chars[nchar(chars, type = "width") > 0]
goodString <- paste(chars, collapse = "")
Both badString and goodString look the same when printed:
> badString
[1] "one two three"
> goodString
[1] "one two three"
but they have different numbers of characters:
> nchar(badString)
[1] 15
> nchar(goodString)
[1] 13
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With