Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove invisible characters from a UTF-8 string

I have an R tibble with UTF-8 character column. When I print the contents of this column for a certain problematic record, everything looks fine: one ‭two‬ three. There are, however, problems when I try to use this string in a RDBMS query which I construct in R and send to the database.

If I copy this string to Notepad++ and convert the encoding to ANSI, I can see that the string actually contains some additional characters that cause the problem: one ‭two‬ three.

A partial solution that works would be conversion to ASCII: iconv(my_string, "UTF-8", "ASCII", sub = "") , but all non-ASCII characters are lost here.

Conversion from UTF-8 to UTF-8 doesn't solve my problem: iconv(my_string, "UTF-8", "UTF-8", sub = "").

Is it possible to remove all invisible characters like the ones above without losing the UTF-8 encoding? That is: how can I convert my string to the form that I see when I print it out in R (without hidden parts)?

like image 966
tomaz Avatar asked Nov 01 '25 04:11

tomaz


2 Answers

Not sure I completely understand what you are trying to do, but you can use stringi or stringr to explicitly specify what characters you want to retain. For your example, it could look something like this. You may have to expand the characters you want to retain, but this approach is one option:

library(stringr)

my_string <- "one ‭two‬ three"

# Specifying that you only want upper and lowercase letters, 
# numbers, punctuation, and whitespace. 
str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\\s]")
[1] "one two three"

# Just checking
stringi::stri_enc_isutf8(str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\\s]"))
[1] TRUE

EDIT: I do want to note that you should check and see how robust this approach is. I have not dealt with invisible characters often so this may not be the best way to go about removing them.

like image 139
Andrew Avatar answered Nov 02 '25 17:11

Andrew


You haven't given us a way to construct your bad string so I can't test this on your data, but it works on this example.

badString <- "one \u200Btwo\u200B three"

chars <- strsplit(badString, "")[[1]]  # Assume badString has one entry; if not, add a loop

chars <- chars[nchar(chars, type = "width") > 0]

goodString <- paste(chars, collapse = "")

Both badString and goodString look the same when printed:

> badString
[1] "one ​two​ three"
> goodString
[1] "one two three"

but they have different numbers of characters:

> nchar(badString)
[1] 15
> nchar(goodString)
[1] 13
like image 26
user2554330 Avatar answered Nov 02 '25 19:11

user2554330