I have a large csv file (20G, almost 200million lines) which I cannot load to memory as a whole----> So I want to load it piece by piece.
I didn't find a way to use file connection in fread (like that in readLines)----> So I tried to use "skip":
for(i in 1:100){
lines=fread(filename,nrows=rowPerRead,skip=(i-1)*rowPerRead)
}
This works fine, at beginning. But it becomes slower as skip getting larger---in a nonlinear fashion. It turns out although those lines are skipped, it still takes a lot of memory during the process and only get cleaned when the process is done. And once the memory is used up, the process becomes very slow.
> system.time({newLines=fread("userinfo4.csv",nrows=1e6,skip=1,quote="") })
   user  system elapsed 
   0.71    0.04    0.73 
> system.time({newLines=fread("userinfo4.csv",nrows=1e6,skip=1e8,quote="") })
Read 1000000 rows and 12 (of 12) columns from 20.049 GB file in 00:01:47
   user  system elapsed 
  21.89   13.76  106.60 
> system.time({newLines=fread("userinfo4.csv",nrows=1e6,skip=1.4e8,quote="") })
Read 1000000 rows and 12 (of 12) columns from 20.049 GB file in 00:02:48
   user  system elapsed 
  16.95   12.49  169.76 
> 
the memory usage for the 2nd and 3rd run. 

So my questions are : 1. Is there a more memory efficient way to run fread with large skip? 2. Is there a way to run fread from a file connection---so I can continue from last read instead of restart from beginning.
You can use the ability of fread to accept a shell command that preprocesses the file as its input. Using this option we can run a gawk script to extract the required lines. Note you may need to install gawk if it is not already on your system (Linux and Unix-like machines usually have it already, on Windows you may need to install it). 
n = 100   # lines to skip
cmd = paste0('gawk "NR > ', n, '" ', filename)
lines = fread(cmd, nrows = rowPerRead)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With