Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to extract relevant rows - pandas

Tags:

python

pandas

I've got a very large pandas df that has the fields group, id_a, id_b and score, ordered by score so the highest is at the top. There is a row for every possible combination of id_a and id_b. I want to extract rows so that there is only one row per id_a and id_b, which reflects the highest score possible without repeating IDs.

Showing an example of what this might look like - the resulting df has 3 rows, with all ids in id_a and id_b appearing once each. In the case of A2/B2 and A1/B1, the row with the best score for both IDs has been used. In the case of A3, the best score related to a row with B1, which had already been used, so the next best score combined with B3 is used.

Input table

Input table

Desired result

Desired result

To achieve this, I've got a loop iterating through the original df. This is incredibly slow with a large dataset. I've tried coming up with alternatives but I am struggling, for e.g.:

  • Once a row is identified for a pair of IDs, I could remove those IDs from the original df, but I'm not sure how to do this without restarting the loop
  • I could split things by group (the example only has 1 group but there would be lots more, with IDs unique across groups) - however this doesn't seem to save time

Can anybody offer any other approaches? Thank you!

import pandas as pd
# Create sample df
group = [1, 1, 1, 1, 1, 1, 1, 1, 1]
id_a = ['A2', 'A1', 'A3', 'A3', 'A2', 'A1', 'A1', 'A2', 'A3']
id_b = ['B2', 'B1', 'B1', 'B3', 'B1', 'B2', 'B3', 'B3', 'B2']
score = [0.99, 0.98, 0.97, 0.96, 0.93, 0.5, 0.41, 0.4, 0.2]
df = pd.DataFrame({'group': group, 'id_a': id_a, 'id_b': id_b, 'score': score})

result = pd.DataFrame(columns=df.columns)

# Extract required rows
for i, row in df.iterrows():
    if len(result) == 0:
        result = row.to_frame().T
    else:
        if ((row['id_a'] in result['id_a'].tolist())
                or (row['id_b'] in result['id_b'].tolist())):
            continue
        else:
            result = pd.concat([result, row.to_frame().T[result.columns]])
like image 828
Jaccar Avatar asked Oct 24 '25 15:10

Jaccar


1 Answers

This looks to me line a linear_sum_assignment problem (i.e. maximizing the sum of score for unique pairs of id_a/id_b), you could use a pivot per group to do this:

from scipy.optimize import linear_sum_assignment

out = {}
for group, g in (df.pivot(index=['group', 'id_a'],
                          columns='id_b', values='score')
                   .groupby(level=0)):
    idx, col = linear_sum_assignment(g, maximize=True)
    out[group] = pd.DataFrame({'id_a': g.index.get_level_values(1)[idx],
                               'id_b': g.columns[col],
                               'score': g.values[idx, col]})

out = pd.concat(out, names=['group']).reset_index(0)

Alternative code (does the same thing with a different order of the steps). This way is preferred if you don't have the same id_a/id_b across groups (this avoids building a large pivoted intermediate):

from scipy.optimize import linear_sum_assignment
import numpy as np

out = []
for group, g in df.sort_values(by=['id_a', 'id_b']).groupby('group'):
    tmp = g.pivot(index='id_a', columns='id_b', values='score')
    idx, col = linear_sum_assignment(tmp, maximize=True)
    out.append(g.iloc[np.ravel_multi_index((idx, col), tmp.shape)])
out = pd.concat(out)

Output:

   group id_a id_b  score
0      1   A1   B1   0.98
1      1   A2   B2   0.99
2      1   A3   B3   0.96
like image 178
mozway Avatar answered Oct 26 '25 05:10

mozway



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!