Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

r - Error: Text after processing all cols in fread (data.table)

Tags:

r

data.table

I tried to import a text file in R (3.4.0) which actually contains 4 columns but the 4th column is mostly empty until 200,000+th row. I use the fread() in package data.table (ver 1.10.4)

fread("test.txt",fill = TRUE, sep = "\t", quote = "", header = FALSE)

I got this error message:

Error in fread("test.txt", fill = TRUE, sep = "\t", quote = "", header = FALSE) : 
Expecting 3 cols, but line 258088 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep='  ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.

I checked the file and there's additional text in 258088th row in the 4th column ("8-4").

Nevertheless, fill = TRUE did not solve this as I expected. I thought it might be fread() determining column numbers inappropriately because the additional column occurs very late in the file. So I tried this:

fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 250000)

The error persisted. On the other hand,

fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 258080)

This gives no error.

I thought I found the reason, but the weird thing happened when I tested with a dummy file generated by:

write.table(matrix(c(1:990000), nrow = 330000), "test2.txt", sep = "\t", row.names = FALSE)

with the addition of a "8-4" in the 4th column of the 250000th row by Excel. When read by fread():

fread("test2.txt", fill = TRUE, header = FALSE, sep = "\t")

It worked fine with no error message, and this should indicate some late additional column not necessarily trigger error.

I also tried changing encoding ("Latin-1" and "UTF-8") or quote, but neither helped.

Now I feel clueless, and hopefully I did my homework enough with a reproducible information. Thank you for helping.

For additional environmental info, my sessionInfo() is:

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] dplyr_0.5.0            purrr_0.2.2.2          readr_1.1.1            tidyr_0.6.3           
  [5] tibble_1.3.3           ggplot2_2.2.1          tidyverse_1.1.1        stringr_1.2.0         
  [9] microbenchmark_1.4-2.1 data.table_1.10.4     

loaded via a namespace (and not attached):
[1] Rcpp_0.12.11     cellranger_1.1.0 compiler_3.4.0   plyr_1.8.4       forcats_0.2.0   
[6] tools_3.4.0      jsonlite_1.5     lubridate_1.6.0  nlme_3.1-131     gtable_0.2.0    
[11] lattice_0.20-35  rlang_0.1.1      psych_1.7.5      DBI_0.6-1        parallel_3.4.0  
[16] haven_1.0.0      xml2_1.1.1       httr_1.2.1       hms_0.3          grid_3.4.0      
[21] R6_2.2.1         readxl_1.0.0     foreign_0.8-68   reshape2_1.4.2   modelr_0.1.0    
[26] magrittr_1.5     scales_0.4.1     rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5    
[31] colorspace_1.3-2 stringi_1.1.5    lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2     
like image 475
YC Chen Avatar asked Nov 24 '25 07:11

YC Chen


2 Answers

Actually there is a difference between the two files that you provide, and I think this is the cause of the different outputs of the fread.

The first file has an end of the line after the 3rd column, except line 258088, where there is a tab a 4th column and then the end of the line. (You can use the option 'show all characters to confirm that').

On the other hand the second file has in all rows an extra tab, i.e. a new empty column. So in the first case fread expects 3 columns and then finds out a 4th column. On the contrary in the second file, fread expects 4 columns.

I checked read.table with fill=TRUE and it worked with both files. So I think that something is done differently with the fill option of the fread.

I would expect since fill=TRUE, all the lines to be used so as to infer the number of columns (with cost on computational time).

In the comments there are some nice workarounds you can use.

like image 104
NpT Avatar answered Nov 26 '25 21:11

NpT


The file has issue: if the table has four columns, at the end of each row with the fourth column missing a \t should have been present.

In this case you may have better luck with a low-level approach: read the file line by line, add a \t to each row which doesn't have the fourth column, split each line with \t and collect all together in a data.frame. Most of the above work is done by the data.table::tstrsplit function. Try something like:

f<-readLines("test.txt")
require(stringr)
require(data.table)
a<-data.frame(tstrsplit(f,"\t",type.convert=TRUE,names=TRUE,keep=1:4),stringsAsFactors=FALSE)
str(a)
#'data.frame':  273070 obs. of  4 variables:
# $ V1: num  0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 ...
# $ V2: num  -18.7113 -1.2685 0.0768 0.1507 0.1609 ...
# $ V3: num  0 0 0 0 0 0 0 0 0 0 ...
# $ V4: chr  NA NA NA NA ...
like image 38
nicola Avatar answered Nov 26 '25 23:11

nicola



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!