Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine if character string contains non-Roman characters in R

Tags:

string

regex

r

What is the preferred way of determining if a string contains non-Roman/non-English (e.g., ないでさ) characters?

like image 822
Brandon Loudermilk Avatar asked Dec 07 '25 07:12

Brandon Loudermilk


1 Answers

You could use regex/grep to check for hex values of characters outside the range of printable ASCII characters:

x <- 'ないでさ'
grep( "[^\x20-\x7F]",x )
#[1] 1
grep( "[^\x20-\x7F]","Normal text" )
#integer(0)

If you wanted to allow the non-printing ("control") character to be considered "English", you could extend the range of the character class in hte first argument to grep to start with "\x01". See ?regex for more information on using character class argumets. See ?Quotes for more information about how to specify characters as Unicode, hexadecimal, or octal values.

The R.oo package has conversion functions that may be useful:

library(R.oo)
?intToChar
?charToInt

The fact that Henrik Bengtsson saw fit to include these in his package says to me that there is no a handy method to do this in base/default R. He's a long-time useR/guRu.

Seeing the other answer prompted this effort which seems straight-forward:

> is.na( iconv( c(x, "OrdinaryASCII") , "", "ASCII") )
[1]  TRUE FALSE
like image 128
IRTFM Avatar answered Dec 09 '25 19:12

IRTFM