Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas `read_csv` Method Is Using Too Much RAM

I am currently playing with Rotten Tomatoes dataset on Kaggle while using pandas DataFrame() to manipulate the data.

I have implemented CountVectorizer() from sklearn to extract the features (the size is 5000). I have then saved 100k rows of features and labels to a .csv. To be more precise, .csv has in total 100k rows and 5001 columns. Its size is about 1gb of memory.

When I have tried to read the .csv the problem arose:

pd.read_csv('train.csv', header=0, 
             delimiter=",", engine='c', na_filter=False, dtype=np.int64)

CSV parser has used too much RAM. I had 8gb of RAM on my system, which apparently was not enough.

Is there any way to reduce RAM usage? I am not constrained to pandas library.

like image 440
Grigoriy Mikhalkin Avatar asked Dec 11 '25 17:12

Grigoriy Mikhalkin


1 Answers

You can try using the chunksize option within pandas.read_csv. It will allow you to process the data in batches and avoid having to load all the data into memory at once. When you're processing each batch you can strip out any unnecessary columns and save the data in a new, slimmer object you can fit into memory. An example is below:

chunks = pd.read_csv('train.csv', header=0, delimiter=",", engine='c', na_filter=False, dtype=np.int64, chunksize=50000)

slim_data = []
for chunk in chunks:
    {do your processing here}
final_data = pd.concat(slim_data)

In the example each chunk is 50,000 records in the format of a pandas DataFrame. Then you iterate across each chunk of 50,000 records where you can do your processing for each chunk and append the processed DataFrame to a new object (slim_data as an example above), then concatenate all the chunks together into a final DataFrame you can use in your modeling.

To reiterate, the above will only work if in your processing of each batch of data you are removing data elements or representing the data elements in a more efficient manner, otherwise you will again run into memory issues. This will, however, get you around having to load all the data at once into memory.

like image 181
vielkind Avatar answered Dec 13 '25 07:12

vielkind



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!