Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting pandas dataframe by multiple columns ignoring case

I have a dataframe in a Python script (using pandas) that needs to be sorted by multiple columns, but the case of the values currently messes up the sorting. For example a and A are not equally sorted. First, the upper-case letters are sorted and then the lower-case ones. Is there any easy way to sort them ignoring case. Currently I have something like this:

df = df.sort(['column1', 'column2', 'column3', 'column4', 'column5', 'column6', 'column7'], ascending=[True, True, True, True, True, True, True])

It is important that the case needs to be ignored for all of the columns and the values mustn't change their case in the final sorted dataframe.

For example column 1 could be sorted like this (ignoring case):

Aaa
aaB
aaC
Bbb
bBc
bbD
CCc
ccd

Also, it would be awesome, if the functionality would work with x number of columns (no hard-coding).

like image 318
E. Muuli Avatar asked Oct 15 '25 18:10

E. Muuli


1 Answers

if you just want to sort according to lower, you could use something like this:

def sort_naive_lowercase(df, columns, ascending=True):
    df_temp = pd.DataFrame(index = df.index, columns=columns)

    for kol in columns:
        df_temp[kol] = df[kol].str.lower()
    new_index = df_temp.sort_values(columns, ascending=ascending).index
    return df.reindex(new_index)

If you expect unicode problems, you might do something like this (borrowing from @nick-hale's comment):

def sort_by_caseless_columns(df, columns, ascending=True):
    # https://stackoverflow.com/a/29247821/1562285
    import unicodedata

    def normalize_caseless(text):
        return unicodedata.normalize("NFKD", text.casefold())
    df_temp = pd.DataFrame(index = df.index, columns=columns)

    for kol in columns:
        df_temp[kol] = df[kol].apply(normalize_caseless)
    new_index = df_temp.sort_values(columns, ascending=ascending).index
    return df.reindex(new_index)

If you have more possible arguments to pass to the sort_values, you can use **kwargs

If not all the columns are strings, but some are numerical, you might have to include an additional mask or set for the non-numerical columns

like image 159
Maarten Fabré Avatar answered Oct 17 '25 06:10

Maarten Fabré



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!