Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby transpose

I have a file from SAP which wasn't the prettiest one when dealing with data. So using series.str.contains() and boolean masks I have managed to narrow down to a dataframe looking like below:

       0        1
0    SUB      123
1    CAT      SKU
2   CODE  1000123
3   CODE  1000234
4    SUB      456
5    CAT      LIQ
6  CODE1  1000345
7  CODE1  1000534
8  CODE1  1000433

I am looking for a way where I can separate each SUB into a new entry like below:

print(expected_df)

   SUB  CAT       CODE      CODE1
0  123  SKU  1000123.0        NaN
1  123  SKU  1000234.0        NaN
2  456  LIQ        NaN  1000345.0
3  456  LIQ        NaN  1000534.0
4  456  LIQ        NaN  1000433.0

I just cant seem to get pass this step. However, this line:

df[0].eq('SUB').cumsum()

helps to segregate the groups and can be used as a helper series if needed.

Any help in transposing the data as shown would be really appreciated.

Thanks.

like image 646
anky Avatar asked Nov 21 '25 10:11

anky


2 Answers

You can try of using df.pivot followed by .ffill(),bfill() for the specific 'SUB' column group rows.

df1 = df.pivot(columns='0')
df1.columns = df1.columns.map(lambda x: x[1])
df1.SUB = df1.SUB.ffill()
df1.groupby('SUB').ffill().groupby('SUB').bfill().drop_duplicates()
#5.89 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

# as time constraints, without use of lambda operation
#df1.groupby(df1.SUB.ffill()).apply(lambda x: x.ffill().bfill()).drop_duplicates()
#16 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Out:

    SUB CAT CODE    CODE1   SUB
2   123 SKU 1000123 NaN     123
3   123 SKU 1000234 NaN      123
6   456 LIQ NaN     1000345 456
7   456 LIQ NaN     1000534 456
8   456 LIQ NaN     1000433 456
like image 62
Naga kiran Avatar answered Nov 23 '25 02:11

Naga kiran


IIUC,

df.set_index('col1').groupby(df.col1.eq('SUB').cumsum().values).apply(lambda s: pd.DataFrame({
    'SUB': s.loc['SUB'].item(),
    'CAT': s.loc['CAT'].item(),
     s.index[2]: s.loc[s.index[2]].col2.tolist()
})).reset_index(drop=True)

Outputs

    SUB CAT CODE    CODE1
0   123 SKU 1000123 NaN
1   123 SKU 1000234 NaN
2   456 LIQ NaN     1000345
3   456 LIQ NaN     1000534
4   456 LIQ NaN     1000433

However, this looks like an XY problem. Maybe it's worth taking a look into how you ended up with this df in the first place

like image 31
rafaelc Avatar answered Nov 23 '25 00:11

rafaelc