I am using pandas module. In my DataFrame 3 fields are account ,month and salary.
    account month              Salary
    1       201501             10000
    2       201506             20000
    2       201506             20000
    3       201508             30000
    3       201508             30000
    3       201506             10000
    3       201506             10000
    3       201506             10000
    3       201506             10000
I am doing groupby on Account and Month and convert salary to percent of salary of group it belongs.
MyDataFrame['salary'] = MyDataFrame.groupby(['account'], ['month'])['salary'].transform(lambda x: x/x.sum())
Now MyDataFrame becomes like below table
    account month              Salary
    1       201501             1
    2       201506             .5
    2       201506             .5
    3       201508             .5
    3       201508             .5
    3       201506             .25
    3       201506             .25
    3       201506             .25
    3       201506             .25
Problem is: Operation on 50 million such rows is taking 3 hours. I executed groupyby separately it is fast takes 5 seconds only.I think it is transform taking long time here. is there any way to improve performance ?
Update: To provide more clarity adding example Some account holder received salary 2000 in Jun and 8000 in July so his proportion becomes .2 for Jun and .8 for July. my purpose is to calculate this proportion.
Well you need be more explicit and show exactly what you are doing. This is something pandas excels at.
Note for @Uri Goren. This is a constant memory process and only has 1 group in memory at a time. This will scale linearly with the number of groups. Sorting is also unecessary.
In [20]: np.random.seed(1234)
In [21]: ngroups = 1000
In [22]: nrows = 50000000
In [23]: dates = pd.date_range('20000101',freq='MS',periods=ngroups)
In [24]:  df = DataFrame({'account' : np.random.randint(0,ngroups,size=nrows),
                 'date' : dates.take(np.random.randint(0,ngroups,size=nrows)),
                 'values' : np.random.randn(nrows) })
In [25]: 
In [25]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000000 entries, 0 to 49999999
Data columns (total 3 columns):
account    int64
date       datetime64[ns]
values     float64
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 1.5 GB
In [26]: df.head()
Out[26]: 
   account       date    values
0      815 2048-02-01 -0.412587
1      723 2023-01-01 -0.098131
2      294 2020-11-01 -2.899752
3       53 2058-02-01 -0.469925
4      204 2080-11-01  1.389950
In [27]: %timeit df.groupby(['account','date']).sum()
1 loops, best of 3: 8.08 s per loop
If you want to transform the output, then doit like this
In [37]: g = df.groupby(['account','date'])['values']
In [38]: result = 100*df['values']/g.transform('sum')
In [41]: result.head()
Out[41]: 
0     4.688957
1    -2.340621
2   -80.042089
3   -13.813078
4   -70.857014
dtype: float64
In [43]: len(result)
Out[43]: 50000000
In [42]: %timeit 100*df['values']/g.transform('sum')
1 loops, best of 3: 30.9 s per loop
Take a bit longer. But again this should be a relatively fast operation.
I would use a different approach First Sort,
MyDataFrame.sort(['account','month'],inplace=True)
Then iterate and sum
(account,month)=('','') #some invalid values
salary=0.0
res=[]
for index, row in MyDataFrame.iterrows():
  if (row['account'],row['month'])==(account,month):
    salary+=row['salary']
  else:
    res.append([account,month,salary])
    salary=0.0
    (account,month)=(row['account'],row['month'])
df=pd.DataFrame(res,columns=['account','month','salary'])
This way, pandas don't need to hold the grouped data in memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With