Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Natural sorting in pandas

I have this data in pandas

data = [
        ['ID', 'Time', 'oneMissing', 'singleValue', 'empty', 'oneEmpty'],
        ['CS1-1', 1,  10000, None, None, 0],
        ['CS1-2', 2, 20000, 0.0,  None, 0],
        ['CS1-1', 2, 30000, None, None, 0],
        ['CS1-2', 1,  10000, None, None, None],
        ['CS1-11', 1, None,  0.0,  None, None],
        ['CS1-2', 3, 30000, None, None, None]
    ]

that I try to sort by ID and Time columns so the result should be like

        'CS1-1', 1,  10000, None, None, 0
        'CS1-1', 2, 30000, None, None, 0
        'CS1-2', 1,  10000, None, None, None
        'CS1-2', 2, 20000, 0.0,  None, 0
        'CS1-2', 3, 30000, None, None, None
        'CS1-11', 1, None,  0.0,  None, None
    ]

I'm using pandas dataframe for the sorting, also tried together with natsort, but I can't get it to work. Either I get errors that the index includes duplicates (I use ID as the index) or it sorts by string values.

The ID here is just an example. I don't know what format it will be, it might be NUMBER-LETTER or NUMBER LETTER NUMBER. I just need to compare all numbers as a number. I've looked at "natsort" and that seems to do correct for an array. So I think it should be possible to use that to sort the ID and then re-index the data.

I've looked at multiple sources like these, but without any luck: Alphanumeric sorting Sort dataframes

like image 596
Mackan Avatar asked Oct 31 '25 19:10

Mackan


2 Answers

Use str.extract, sort_values, then use the index to reindex df.

idx = (df.assign(ID2=df.ID.str.extract(r'(\d+)$').astype(int))
         .sort_values(['ID2', 'Time'])
         .index)

df.iloc[idx]

       ID  Time  oneMissing  singleValue empty  oneEmpty
0   CS1-1     1     10000.0          NaN  None       0.0
2   CS1-1     2     30000.0          NaN  None       0.0
3   CS1-2     1     10000.0          NaN  None       NaN
1   CS1-2     2     20000.0          0.0  None       0.0
5   CS1-2     3     30000.0          NaN  None       NaN
4  CS1-11     1         NaN          0.0  None       NaN

This is under the assumption that your ID column follows the pattern "XXX-NUMBER".


A fool-proof solution will involve the use of the natsort module, which excels at fast natural sorting. With a little elbow-grease, we can argsort your data.

from natsort import natsorted
idx, *_ = zip(*natsorted(
    zip(df.index, df.ID, df.Time), key=lambda x: (x[1], x[2])))

df.iloc[list(idx)]

       ID  Time  oneMissing  singleValue empty  oneEmpty
0   CS1-1     1     10000.0          NaN  None       0.0
2   CS1-1     2     30000.0          NaN  None       0.0
3   CS1-2     1     10000.0          NaN  None       NaN
1   CS1-2     2     20000.0          0.0  None       0.0
5   CS1-2     3     30000.0          NaN  None       NaN
4  CS1-11     1         NaN          0.0  None       NaN

Use PyPi to install: pip install natsort.

like image 168
cs95 Avatar answered Nov 03 '25 11:11

cs95


Note: This method assumes you wish to sort numerically by X for IDs of the form ABC-X.

np.lexsort supports sorting by multiple series, and avoids having to add an extra series to your dataframe. This example sorts by the suffix of ID numerically and then by Time:

df = pd.DataFrame(data[1:], columns=data[0])

id_num = df['ID'].str.split('-').str[-1].astype(int)

df = df.iloc[np.lexsort((df['Time'], id_num))]

print(df)

       ID  Time  oneMissing  singleValue empty  oneEmpty
0   CS1-1     1     10000.0          NaN  None       0.0
2   CS1-1     2     30000.0          NaN  None       0.0
3   CS1-2     1     10000.0          NaN  None       NaN
1   CS1-2     2     20000.0          0.0  None       0.0
5   CS1-2     3     30000.0          NaN  None       NaN
4  CS1-11     1         NaN          0.0  None       NaN
like image 38
jpp Avatar answered Nov 03 '25 11:11

jpp



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!