How to reduce the memory usage and speed up the code

Question

I am using huge dataset with 5 columns and more that 90 million rows. The code works fine with part of the data, but when it comes to the whole I get Memory Error. I read about generators, but it appears very complex for me. Can I get explanation based on this code?

df = pd.read_csv('D:.../test.csv', names=["id_easy","ordinal", "timestamp", "latitude", "longitude"])

df = df[:-1]
df.loc[:,'timestamp'] = pd.to_datetime(df.loc[:,'timestamp'])
pd.set_option('float_format', '{:f}'.format)
df['epoch'] = df.loc[:, 'timestamp'].astype('int64')//1e9
df['day_of_week'] = pd.to_datetime(df['epoch'], unit="s").dt.weekday_name
del df['timestamp']

for day in ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']:
    day_df = df.loc[df['day_of_week'] == day]
    day_df.to_csv(f'{day}.csv', index=False,)

Error appears on the last for loop operation

Sample data:

d4ace40905729245a5a0bc3fb748d2b3    1   2016-06-01T08:18:46.000Z    22.9484 56.7728
d4ace40905729245a5a0bc3fb748d2b3    2   2016-06-01T08:28:05.000Z    22.9503 56.7748

UPDATED

I did this:

chunk_list = []  

for chunk in df_chunk:  
    chunk_list.append(chunk)
df_concat = pd.concat(chunk_list)

I have no idea how to proceed now? How to apply the rest of the code?

Massifox · Accepted Answer

My advice is to switch to Dask or Spark.

If you want to continue using pandas, try the following tips to read a CSV file, with pandas.read_csv:

chunksize parameter: that allows you to read a piece of files at a time. For example in your case you could use chunksize equal to a million, you would get 90 chunks and you could operate on each chunk individually.
dtype parameter: with this parameter you can specify the data type of each column simply by passing a dictionary like this: {‘a’: np.float32, ‘b’: np.int32, ‘c’: ‘Int32’}
Pandas could use 64-bit data types, while 32bit might be enough for you. With this trick you could save 50% of the space.

Yuor case study

Try this code:

df_chunks = pd.read_csv('test.csv', chunksize=1000000, iterator=True, 
                         parse_dates=['timestamp'], error_bad_lines=False,
                         dtype={"ordinal":'int32', "latitude": 'float32', "longitude":'float32'})
for chunk in df_chunks:
    # chunk = chunk.apply(...) # process the single chunk 
    for day in ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']:
        day_df = chunk.loc[chunk['day_of_week'] == day]
        day_df.to_csv(f'{day}.csv', mode='a', index=0, header=False)

This way you work on one chunk of data at a time and never work with all the data together. The mode='a' tells pandas to append.

Note1: You do not need pandas.concat here. The only thing iterator and chunksize=1000000 does is to give you a reader object that iterates 1000000-row DataFrames instead of reading the whole thing. Using concat you lose all the advantages of using iterators and loading the whole file into memory exactly like using csv laws without specifying chunksize.

Note2: If the 'MemoryError' error persists, try smaller chunksize.

How to reduce the memory usage and speed up the code

Tags:

python

pandas

out-of-memory

Sample data:

UPDATED

Mamed

1 Answers

Yuor case study

Massifox

Recent Activity

Donate For Us

How to reduce the memory usage and speed up the code

Tags:

python

pandas

out-of-memory

Sample data:

UPDATED

Mamed

1 Answers

Yuor case study

Massifox

Related questions

Recent Activity

Donate For Us