Python: Unstacked DataFrame is too big, causing int32 overflow

Question

I have a big dataset and when I try to run this code I get a memory error.

user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()

here is the error:

ValueError: Unstacked DataFrame is too big, causing int32 overflow

I have run it on another machine and it worked fine! how can I fix this error?

igorkf · Accepted Answer

As @Ehsan pointed out, we can pivot the tables in chunks.

Suppose you have a DataFrame with 3,355,205 rows!
Let's build chunks of size 5000:

chunk_size = 5000
chunks = [x for x in range(0, df.shape[0], chunk_size)]

for i in range(0, len(chunks) - 1):
    print(chunks[i], chunks[i + 1] - 1)

0 4999
5000 9999
10000 14999
15000 19999
20000 24999
25000 29999
30000 34999
35000 39999
40000 44999
45000 49999
50000 54999
55000 59999
60000 64999
65000 69999
70000 74999
75000 79999
..continue..

All you have to do now is a list comprehension inside a pd.concat():

df_new = pd.concat([df.iloc[ chunks[i]:chunks[i + 1] - 1 ].pivot(index='user_id', columns='item', values='views') for i in range(0, len(chunks) - 1)])

This answer is good when you have to make a sparse matrix to some recommendation system.
After this you could do:

from scipy import sparse
spr = sparse.coo_matrix(df_new.to_numpy())

Guillermo Mosse · Answer

According to Google, you can downgrade your pandas version to 0.21 which has no problem with pivot table and too big data.

Python: Unstacked DataFrame is too big, causing int32 overflow

Tags:

python

pandas

data-analysis

data-science

Hamid

2 Answers

igorkf

Guillermo Mosse

Recent Activity

Donate For Us

Python: Unstacked DataFrame is too big, causing int32 overflow

Tags:

python

pandas

data-analysis

data-science

Hamid

2 Answers

igorkf

Guillermo Mosse

Related questions

Recent Activity

Donate For Us