I'm writing a program that involves analyzing CSV files of minimum 0.5GB (and maximum of over 20GB), I read from the CSV as follows with fstream, while (getline(fin,line)) {}, and doing an average of 17millisecs work on each comma separated record. Simple stuff.
But, there are a LOT of records. So obviously, the program is I/O bound, but I was wondering whether I could improve the I/O performance. I can't resort to OpenMP as I would deal with CPU constraints, and buffering a file this large won't work either. So I might need some kind of pipeline...
I have VERY little experience in multithreading in C++ and have never used dataflow frameworks. Could anyone point me in the right direction?
Update (12/23/14) :
Thanks for all your comments. You are right, 17ms was a bit much... After doing a LOT of profiling (oh, the pain), I isolated the bottleneck as an iteration over a substring in each record (75 chars). I experimented with #pragmas but it simply isn't enough work to parallelize. the overhead of the function call was the main gripe - now 5.41μs per record, having shifted a big block. It's ugly, but faster.
Thanks @ChrisWard1000 for your suggestions. Unfortunately I do not much have control over the hardware I'm using at the moment, but will profile with larger data sets (>20GB CSV) and see how I could introduce mmap/multithreaded parsing etc.
17ms per record is extremely high, it should not be difficult to improve upon that, unless you are using some seriously antiquated hardware.
Upgrade the hardware. SSD's, RAID striping and PCI express hard disks are designed for this kind of activity.
Read the file in larger chunks at a time, reducing I/O waiting times. Perhaps use fread to dump large chunks to memory first.
Consider using mmap to map a pointer between hard disk and memory.
Most importantly profile your code to see where the delays are. This is notoriously difficult with I/O activity because it differs between machines and it often varies significantly at runtime.
You could attempt to add multithreaded parsing, however I strongly suggest you try this as a last resort, and understand that it will likely be the cause of a lot of pain and suffering.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With