Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fread: empty string ("") in na.strings is not interpreted as NA

How can I get fread() to set "" to a NA for all variables including character variables?

I am importing a .csv file where missing values are empty strings (""; no space). I want "" to be interpreted as missing value NA and tried `na.strings = "" without success:

data <- fread("file.csv", na.strings = "")

unique(data$character_variable)
# [1] "abc" "def"      ""            

On the other hand, when I use read.csv with na.strings = "", the "" are turned into NAs, even for character variables. This is the result I want.

data <- read.csv("file.csv", na.strings = "")

unique(data$character_variable)
# [1] "abc" "def"      NA

versions

  • R version 3.6.1 (2019-07-05)
  • data.table_1.12.8
like image 468
Danielle Avatar asked Dec 09 '25 11:12

Danielle


1 Answers

Well, you can't if your csv file looks like this

a,b
x,y
"",1

Note that whatever inside the "" is treated as a string literal because "" are the escape characters. In that sense, ,"", in a csv file just means an empty string, but not a missing value (i.e. ,,). I would consider this a good feature for consistency. This is also written in the section na.strings of the documentation of fread:

A character vector of strings which are to be interpreted as NA values. By default, ",," for columns of all types, including type character is read as NA for consistency. ,"", is unambiguous and read as an empty string. To read ,NA, as NA, set na.strings="NA". To read ,, as blank string "", set na.strings=NULL. When they occur in the file, the strings in na.strings should not appear quoted since that is how the string literal ,"NA", is distinguished from ,NA,, for example, when na.strings="NA".

On the other hand, you may notice that if the file looks like this

a,b
1,y
"",1

, then the empty string will be converted into NA. However, I think it's not a bug because this behaviour is probably a consequence of type coercion by the parser. In the Details section of the same document, you can see that

The lowest type for each column is chosen from the ordered list: logical, integer, integer64, double, character.

So column a is first read as a character column and later converted into an integer one. The empty string is still read as is but coerced into an NA_integer_ in the second step.

like image 195
ekoam Avatar answered Dec 12 '25 02:12

ekoam



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!