Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to rapidaly load data into memory with python?

I have a large csv file (5 GB) and I can read it with pandas.read_csv(). This operation takes a lot of time 10-20 minutes.

How can I speed it up?

Would it be useful to transform the data in a sqllite format? In case what should I do?

EDIT: More information: The data contains 1852 columns and 350000 rows. Most of the columns are float65 and contain numbers. Some other contains string or dates (that I suppose are considered as string)

I am using a laptop with 16 GB of RAM and SSD hard drive. The data should fit fine in memory (but I know that python tends to increase the data size)

EDIT 2 :

During the loading I receive this message

/usr/local/lib/python3.4/dist-packages/pandas/io/parsers.py:1164: DtypeWarning: Columns (1841,1842,1844) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

EDIT: SOLUTION

Read one time the csv file and save it as data.to_hdf('data.h5', 'table') This format is incredibly efficient

like image 408
Donbeo Avatar asked Oct 30 '25 14:10

Donbeo


1 Answers

This actually depends on which part of reading it is taking 10 minutes.

  • If it's actually reading from disk, then obviously any more compact form of the data will be better.
  • If it's processing the CSV format (you can tell this because your CPU is at near 100% on one core while reading; it'll be very low for the other two), then you want a form that's already preprocessed.
  • If it's swapping memory, e.g., because you only have 2GB of physical RAM, then nothing is going to help except splitting the data.

It's important to know which one you have. For example, stream-compressing the data (e.g., with gzip) will make the first problem a lot better, but the second one even worse.

It sounds like you probably have the second problem, which is good to know. (However, there are things you can do that will probably be better no matter what the problem.)


Your idea of storing it in a sqlite database is nice because it can at least potentially solve all three at once; you only read the data in from disk as-needed, and it's stored in a reasonably compact and easy-to-process form. But it's not the best possible solution for the first two, just a "pretty good" one.

In particular, if you actually do need to do array-wide work across all 350000 rows, and can't translate that work into SQL queries, you're not going to get much benefit out of sqlite. Ultimately, you're going to be doing a giant SELECT to pull in all the data and then process it all into one big frame.


Writing out the shape and structure information, then writing the underlying arrays in NumPy binary form. Then, for reading, you have to reverse that. NumPy's binary form just stores the raw data as compactly as possible, and it's a format that can be written blindingly quickly (it's basically just dumping the raw in-memory storage to disk). That will improve both the first and second problems.


Similarly, storing the data in HDF5 (either using Pandas IO or an external library like PyTables or h5py) will improve both the first and second problems. HDF5 is designed to be a reasonably compact and simple format for storing the same kind of data you usually store in Pandas. (And it includes optional compression as a built-in feature, so if you know which of the two you have, you can tune it.) It won't solve the second problem quite as well as the last option, but probably well enough, and it's much simpler (once you get past setting up your HDF5 libraries).


Finally, pickling the data may sometimes be faster. pickle is Python's native serialization format, and it's hookable by third-party modules—and NumPy and Pandas have both hooked it to do a reasonably good job of pickling their data.

(Although this doesn't apply to the question, it may help someone searching later: If you're using Python 2.x, make sure to explicitly use pickle format 2; IIRC, NumPy is very bad at the default pickle format 0. In Python 3.0+, this isn't relevant, because the default format is at least 3.)

like image 140
abarnert Avatar answered Nov 02 '25 05:11

abarnert