A common SQLism is "Select A, mean(X) from table group by A" and I would like to replicate this in pandas. Suppose that the data is stored in something like a CSV file and is too big to be loaded into memory.
If the CSV could fit in memory a simple two-liner would suffice:
data=pandas.read_csv("report.csv")
mean=data.groupby(data.A).mean()
When the CSV cannot be read into memory one might try:
chunks=pandas.read_csv("report.csv",chunksize=whatever)
cmeans=pandas.concat([chunk.groupby(data.A).mean() for chunk in chunks])
badMeans=cmeans.groupby(cmeans.A).mean()
Except that the resulting cmeans table contains repeated entries for each distinct value of A, one for each appearance of that value of A in distinct chunks (since read_csv's chunksize knows nothing about the grouping fields). As a result the final badMeans table has the wrong answer... it needs to compute a weighted average mean.
So a working approach seems to be something like:
final=pandas.DataFrame({"A":[],"mean":[],"cnt":[]})
for chunk in chunks:
t=chunk.groupby(chunk.A).sum()
c=chunk.groupby(chunk.A).count()
cmean=pandas.DataFrame({"tot":t,"cnt":c}).reset_index()
joined=pandas.concat(final,cmean)
final=joined.groupby(joined.A).sum().reset_indeX()
mean=final.tot/final.cnt
Am I missing something? This seems insanely complicated... I would rather write a for loop that processes a CSV line by line than deal with this. There has to be a better way.
I think you could do something like the following which seems a bit simpler to me. I made the following data:
id,val
A,2
A,5
B,4
A,2
C,9
A,7
B,6
B,1
B,2
C,4
C,4
A,6
A,9
A,10
A,11
C,12
A,4
A,4
B,6
B,5
C,7
C,8
B,9
B,10
B,11
A,20
I'll do chunks of 5:
chunks = pd.read_csv("foo.csv",chunksize=5)
pieces = [x.groupby('id')['val'].agg(['sum','count']) for x in chunks]
agg = pd.concat(pieces).groupby(level=0).sum()
print agg['sum']/agg['count']
id
A 7.272727
B 6.000000
C 7.333333
Compared to the non-chunk version:
df = pd.read_csv('foo.csv')
print df.groupby('id')['val'].mean()
id
A 7.272727
B 6.000000
C 7.333333
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With