Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Unstacked DataFrame is too big, causing int32 overflow

I have a big dataset and when I try to run this code I get a memory error.

user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()

here is the error:

ValueError: Unstacked DataFrame is too big, causing int32 overflow

I have run it on another machine and it worked fine! how can I fix this error?

like image 211
Hamid Avatar asked Jan 17 '26 21:01

Hamid


2 Answers

As @Ehsan pointed out, we can pivot the tables in chunks.

Suppose you have a DataFrame with 3,355,205 rows!
Let's build chunks of size 5000:

chunk_size = 5000
chunks = [x for x in range(0, df.shape[0], chunk_size)]

for i in range(0, len(chunks) - 1):
    print(chunks[i], chunks[i + 1] - 1)

0 4999
5000 9999
10000 14999
15000 19999
20000 24999
25000 29999
30000 34999
35000 39999
40000 44999
45000 49999
50000 54999
55000 59999
60000 64999
65000 69999
70000 74999
75000 79999
..continue..

All you have to do now is a list comprehension inside a pd.concat():

df_new = pd.concat([df.iloc[ chunks[i]:chunks[i + 1] - 1 ].pivot(index='user_id', columns='item', values='views') for i in range(0, len(chunks) - 1)])

This answer is good when you have to make a sparse matrix to some recommendation system.
After this you could do:

from scipy import sparse
spr = sparse.coo_matrix(df_new.to_numpy())
like image 64
igorkf Avatar answered Jan 20 '26 10:01

igorkf


According to Google, you can downgrade your pandas version to 0.21 which has no problem with pivot table and too big data.

like image 32
Guillermo Mosse Avatar answered Jan 20 '26 09:01

Guillermo Mosse