Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas comparing dataframe with student results against historic quantiles

Tags:

python

pandas

I have two dataframes. One shows student test results by class on two tests

import pandas as pd   
 results = pd.DataFrame({
    'id':[1,2,3],
    'class':[1,1,2],
    'test_1':[0.67,0.88,0.33],
    'test_2':[0.76,0.63,0.78]})
    results
   id  class  test_1  test_2
0   1      1    0.67    0.76
1   2      1    0.88    0.63
2   3      2    0.33    0.78

The other shows quantiles by class and test based on previous semesters

quantiles = pd.DataFrame({'class':[1,2],
'test_1_0.25':[0.23,0.31],
'test_1_0.5':[0.54,0.67],
'test_1_0.75':[0.8,0.9],
'test_2_0.25':[0.23,0.31],
'test_2_0.5':[0.54,0.67],
'test_2_0.75':[0.8,0.9]})
  class  test_1_0.25  test_1_0.5  test_1_0.75  test_2_0.25  test_2_0.5  \
0      1         0.23        0.54          0.8         0.23        0.54   
1      2         0.31        0.67          0.9         0.31        0.67

   test_2_0.75  
0          0.8  
1          0.9

I would like to return a datarfame that tells me what quantile they place in. 0 if they are below 25, 1 if below 50, 2 if below 75, and 3 if above 75. So the output would look like this

   id  test_1_quantile  test_2_quantile  
0   1                2                2   
1   2                3                1   
2   3                1                2  

Any help is much appreciated. Thanks

like image 917
L Xandor Avatar asked Jan 20 '26 02:01

L Xandor


1 Answers

First DataFrame.merge both DataFrame, then loop be all test values and processing - first DataFrame.filter by same test, add column for test values bellow .25 quantile, set new columns names for output range and compare by DataFrame.lt. Last change order of columns by iloc and get column name of first True value for replace test column:

df = pd.merge(results, quantiles, on='class')

for t in results.columns.difference(['id','class']):
    #print (t)
    df1 = df.filter(like=t)
    df1.insert(1, t + '_0', 0)
    df1.columns = [t] + list(range(4))
    #print (df1)
    a = df1.iloc[:, 1:].lt(df1[t], axis=0).iloc[:, ::-1].idxmax(axis=1)
    df[t] = a

print (df[results.columns])
   id  class  test_1  test_2
0   1      1       2       2
1   2      1       3       2
2   3      2       1       2
like image 164
jezrael Avatar answered Jan 22 '26 14:01

jezrael