remove invisible characters from a UTF-8 string

Question

I have an R tibble with UTF-8 character column. When I print the contents of this column for a certain problematic record, everything looks fine: one ‭two‬ three. There are, however, problems when I try to use this string in a RDBMS query which I construct in R and send to the database.

If I copy this string to Notepad++ and convert the encoding to ANSI, I can see that the string actually contains some additional characters that cause the problem: one â€twoâ€¬ three.

A partial solution that works would be conversion to ASCII: iconv(my_string, "UTF-8", "ASCII", sub = "") , but all non-ASCII characters are lost here.

Conversion from UTF-8 to UTF-8 doesn't solve my problem: iconv(my_string, "UTF-8", "UTF-8", sub = "").

Is it possible to remove all invisible characters like the ones above without losing the UTF-8 encoding? That is: how can I convert my string to the form that I see when I print it out in R (without hidden parts)?

Andrew · Accepted Answer

Not sure I completely understand what you are trying to do, but you can use stringi or stringr to explicitly specify what characters you want to retain. For your example, it could look something like this. You may have to expand the characters you want to retain, but this approach is one option:

library(stringr)

my_string <- "one â€twoâ€¬ three"

# Specifying that you only want upper and lowercase letters, 
# numbers, punctuation, and whitespace. 
str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\s]")
[1] "one two three"

# Just checking
stringi::stri_enc_isutf8(str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\s]"))
[1] TRUE

EDIT: I do want to note that you should check and see how robust this approach is. I have not dealt with invisible characters often so this may not be the best way to go about removing them.

user2554330 · Answer

You haven't given us a way to construct your bad string so I can't test this on your data, but it works on this example.

badString <- "one \u200Btwo\u200B three"

chars <- strsplit(badString, "")[[1]]  # Assume badString has one entry; if not, add a loop

chars <- chars[nchar(chars, type = "width") > 0]

goodString <- paste(chars, collapse = "")

Both badString and goodString look the same when printed:

> badString
[1] "one two three"
> goodString
[1] "one two three"

but they have different numbers of characters:

> nchar(badString)
[1] 15
> nchar(goodString)
[1] 13

remove invisible characters from a UTF-8 string

Tags:

r

character-encoding

tomaz

2 Answers

Andrew

user2554330

Recent Activity

Donate For Us

remove invisible characters from a UTF-8 string

Tags:

r

character-encoding

tomaz

2 Answers

Andrew

user2554330

Related questions

Recent Activity

Donate For Us