I am processing SPSS data from a questionnaire that must have originated in M$ Word. Word automatically changes hyphens into long hyphens, and gets converted into characters that don't display properly, i.e. "-" turns into "ú".
My question: What is the equivalent to utf8ToInt() in the WINDOWS-1252 character set?
utf8ToInt("A")
[1] 65
When I do this with my own data, I get an error:
x <- str_sub(levels(sd$j1)[1], 7, 7)
print(x)
[1] "ú"
utf8ToInt(x)
Error in utf8ToInt(x) : invalid UTF-8 string
However, the contents of x are perfectly usable in grep and gsub expressions.
> Sys.getlocale()
[1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252"
Windows-1252 and ASCII The first part of Windows-1252 (entity numbers from 0-127) is the original ASCII character-set. It contains numbers, upper and lowercase English letters, and some special characters.
Originally, Windows code page 1252, the code page commonly used for English and other Western European languages, was based on an American National Standards Institute (ANSI) draft.
Just open up the windows-1252 encoded file in Notepad, then choose 'Save as' and set encoding to UTF-8.
If you load the SPSS sav file via read.spss form package foreign, you could easily import the data frame with correct encoding via specifying the encoding like:
read.spss("foo.sav", reencode="CP1252")
After some head-scratching, lots of reading help files and trial-and-error, I created two little functions that does what I need. These functions work by converting their input into UTF-8 and then returning the integer vector for the UTF-8 encoded character vector, and vice versa.
# Convert character to integer vector
# Optional encoding specifies encoding of x, defaults to current locale
encToInt <- function(x, encoding=localeToCharset()){
    utf8ToInt(iconv(x, encoding, "UTF-8"))
}
# Convert integer vector to character vector
# Optional encoding specifies encoding of x, defaults to current locale
intToEnc <- function(x, encoding=localeToCharset()){
    iconv(intToUtf8(x), "utf-8",  encoding)
}
Some examples:
x <- "\xfa"
encToInt(x)
[1] 250
intToEnc(250)
[1] "ú"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With