I have a few files that are ~64GB in size that I think I would like to convert to hdf5 format. I was wondering what the best approach for doing so would be? Reading line-by-line seems to take more than 4 hours, so I was thinking of using multiprocessing in sequence, but was hoping for some direction on what would be the most efficient way without resorting to hadoop. Any help would be very much appreciated. (and thank you in advance)
For this type of problem I typically turn from Python. You're right that multiprocessing/parallelization is a good solution, but Python is not pleasant to work with in this area. Consider trying something on the JVM. I like Clojure's core.async, but there's also the peach ("parallel each") or celluloid libraries for JRuby that's much closer to Python.
The approach doesn't have to be as "heavy" as Hadoop, but I'd still use a similar map/reduce pattern over the files. Have a thread that is reading line by line from the source file(s) and dispatching to several threads. (Using core.async I'd have multiple queues which are getting consumed by different threads, then feeding back a "finished" signal into a watchdog thread.) In the end you should be able to squeeze a lot of performance out of your CPU.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With