Pandas GroupBy Mean of Large DataSet in CSV

Question

A common SQLism is "Select A, mean(X) from table group by A" and I would like to replicate this in pandas. Suppose that the data is stored in something like a CSV file and is too big to be loaded into memory.

If the CSV could fit in memory a simple two-liner would suffice:

data=pandas.read_csv("report.csv")
mean=data.groupby(data.A).mean()

When the CSV cannot be read into memory one might try:

chunks=pandas.read_csv("report.csv",chunksize=whatever)
cmeans=pandas.concat([chunk.groupby(data.A).mean() for chunk in chunks])
badMeans=cmeans.groupby(cmeans.A).mean()

Except that the resulting cmeans table contains repeated entries for each distinct value of A, one for each appearance of that value of A in distinct chunks (since read_csv's chunksize knows nothing about the grouping fields). As a result the final badMeans table has the wrong answer... it needs to compute a weighted average mean.

So a working approach seems to be something like:

final=pandas.DataFrame({"A":[],"mean":[],"cnt":[]})
for chunk in chunks:
    t=chunk.groupby(chunk.A).sum()
    c=chunk.groupby(chunk.A).count()
    cmean=pandas.DataFrame({"tot":t,"cnt":c}).reset_index()
    joined=pandas.concat(final,cmean)
    final=joined.groupby(joined.A).sum().reset_indeX()

mean=final.tot/final.cnt

Am I missing something? This seems insanely complicated... I would rather write a for loop that processes a CSV line by line than deal with this. There has to be a better way.

Karl D. · Accepted Answer

I think you could do something like the following which seems a bit simpler to me. I made the following data:

id,val
A,2
A,5
B,4
A,2
C,9
A,7
B,6
B,1
B,2
C,4
C,4
A,6
A,9
A,10
A,11
C,12
A,4
A,4
B,6
B,5
C,7
C,8
B,9
B,10
B,11
A,20

I'll do chunks of 5:

chunks = pd.read_csv("foo.csv",chunksize=5)
pieces = [x.groupby('id')['val'].agg(['sum','count']) for x in chunks]

agg = pd.concat(pieces).groupby(level=0).sum()
print agg['sum']/agg['count']

id
A     7.272727
B     6.000000
C     7.333333

Compared to the non-chunk version:

df = pd.read_csv('foo.csv')
print df.groupby('id')['val'].mean()

id
A     7.272727
B     6.000000
C     7.333333

Pandas GroupBy Mean of Large DataSet in CSV

Tags:

python

pandas

user116948

1 Answers

Karl D.

Recent Activity

Donate For Us

Pandas GroupBy Mean of Large DataSet in CSV

Tags:

python

pandas

user116948

1 Answers

Karl D.

Related questions

Recent Activity

Donate For Us