Converting large files in python [closed]

Question

I have a few files that are ~64GB in size that I think I would like to convert to hdf5 format. I was wondering what the best approach for doing so would be? Reading line-by-line seems to take more than 4 hours, so I was thinking of using multiprocessing in sequence, but was hoping for some direction on what would be the most efficient way without resorting to hadoop. Any help would be very much appreciated. (and thank you in advance)

asthasr · Accepted Answer

For this type of problem I typically turn from Python. You're right that multiprocessing/parallelization is a good solution, but Python is not pleasant to work with in this area. Consider trying something on the JVM. I like Clojure's core.async, but there's also the peach ("parallel each") or celluloid libraries for JRuby that's much closer to Python.

The approach doesn't have to be as "heavy" as Hadoop, but I'd still use a similar map/reduce pattern over the files. Have a thread that is reading line by line from the source file(s) and dispatching to several threads. (Using core.async I'd have multiple queues which are getting consumed by different threads, then feeding back a "finished" signal into a watchdog thread.) In the end you should be able to squeeze a lot of performance out of your CPU.

Converting large files in python [closed]

Tags:

python

database

large-data

large-files

Cenoc

1 Answers

asthasr

Recent Activity

Donate For Us

Converting large files in python [closed]

Tags:

python

database

large-data

large-files

Cenoc

1 Answers

asthasr

Related questions

Recent Activity

Donate For Us