Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pivoting multiple rows based on sorted values in dataframe

I am working with a dataframe that looks something like this:

df = pd.DataFrame({'ID':['A','A','A','A','B','B','B','B'],'X':[1.2,2.1,3.8,4.5,5.8,6.2,7,8.2],'Y':[10,20,30,40,50,60,70,80],'IsTrue':[1,1,0,0,1,0,0,1],'IdxVar':[1,0,0,0,0,0,0,1]})
df["DistanceToIdx"] = np.sqrt((df.X - df.X[df.groupby('ID')['IdxVar'].transform('idxmax')].reset_index(drop=True))**2 \
                        +(df.Y - df.Y[df.groupby('ID')['IdxVar'].transform('idxmax')].reset_index(drop=True))**2)

I am trying to create a new df that has a single row per ID value, based on IdxVar. Simple enough

newdf = df.loc[df.IdxVar==1,:]

I want my new columns to be some X0_1, X0_2, X1_1, X0_2, where the Xi_j can be seen as the X values for i = IsTrue (either 0 or 1) and the j indicates the sorted index of DistanceToIdx (X0_1 indicates the X value in a given ID that has IsTrue = 0 and the lowest DistanceToIdx).

I can do this through groupby():

groupdf = df.groupby('ID')

for name,group in groupdf:
    for i in range(2):

        newdf.loc[newdf.ID==name, 'X0_{}'.format(i+1)] = \
            group.sort_values(by=['IsTrue','DistanceToIdx'],ascending=True)['X'].values[i]

        newdf.loc[newdf.ID==name, 'X1_{}'.format(i+1)] = \
            group.sort_values(by=['IsTrue','DistanceToIdx'],ascending=True)['X'].values[i+2]

This gives the desired output, but if I want to apply it to more variables then shown here and then loop through 100,000 groups, my loop is taking way too long.

I was wondering if there would be a way to speed this up with just the groupby function. My initial thought was just to figure out the pivot function, but since I want to order my new columns based on an existing row, I am not too confident based on the documentation that it would work.

like image 870
yankeefan11 Avatar asked Feb 27 '26 02:02

yankeefan11


1 Answers

As you already created newdf. I come up with solution using nsmallest to get 2 smallest values of each group, unstack and flatten multiindex columns. Finally, merge back to newdf

df1 = (df.set_index('X').groupby(['ID', 'IsTrue']).DistanceToIdx.nsmallest(2).
          reset_index(level=-1).drop('DistanceToIdx', 1))
s = df1.groupby(level=[0,1]).cumcount().add(1)
df2 = df1.set_index(s, append=True).unstack([1,2]).sort_index(level=2, axis=1)
df2.columns = df2.columns.map('{0[0]}{0[1]}_{0[2]}'.format)

df_final = newdf.merge(df2.reset_index(), on='ID')


Out[239]:

  ID    X   Y  IsTrue  IdxVar  DistanceToIdx  X0_1  X1_1  X0_2  X1_2
0  A  1.2  10       1       1            0.0   3.8   1.2   4.5   2.1
1  B  8.2  80       1       1            0.0   7.0   8.2   6.2   5.8
like image 96
Andy L. Avatar answered Mar 01 '26 16:03

Andy L.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!