Diacritic-insensitive sorting with polars in python

Question

My background is closer to linguistics than programming-
is a way to make .sort() insensitive to diacritcs (accent marks such as á or ô). Currently when I sort I get "aeiou" before "áéíóú" when I want "[aá][eé][ií][oó][uú]" or "aáeéiíoóuú". This, though I doubt it's relevant to this question, is my code (the locale thing was an attempt to fix this suggested by google's gemini):

import os
import locale
import polars as pl

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')  # set the locale to en_US.UTF-8

# print the current working directory
print("Current working directory:", os.getcwd())

# Make paths for the lexicon and the sorted lexicon
LexiconPath = r'C:\Users\mawan\Documents\Code\WorldBuildingCode\KalagyonManNyal_Lexicon.csv'
UnsortedLexiconPath = r'C:\Users\mawan\Documents\Code\WorldBuildingCode\KalagyonManNyal_Lexicon_Sorted.csv'

# read the csv file name KalagyonManNyal_Lexicon and make it a DataFrame
df = pl.read_csv(LexiconPath)

df.fill_null("-")  # fill null values with a dash
# print the first 5 rows of the DataFrame
print(df)

# sort the DataFrame by the column "Lexeme" in alphabetical order

df_sorted = df.sort("Lexeme")

print(df_sorted)

write_csv = df_sorted.write_csv(UnsortedLexiconPath, include_bom=True)  # write the sorted DataFrame to a new csv file

# print(df.filter(df.is_duplicated())) # print duplicated rows
# print(df.columns)

# make a dataframe that filters by the part of speech input by the user

PoS = input("Enter the part of speech you want to filter by: ")

df_PoS = df.filter(df['PoS'] == PoS)
print(df_PoS)

So what I am looking for is either diacritic-insensitive sorting or custom sorting instructions so I can make some diacritics count as separate letter while others don't.

I tried using locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') after gemini, the generative ai, suggested it, but it didn't seem to change anything.

furas · Accepted Answer

First I thought that en_US can't work because it is American English which doesn't use chars áéíóú so it may not sort them. But I tested it with pl_PL.UTF-8 and Polish chars ąęść and it also doesn't work.

Finally in documentation for sorting - comparison functions I found method which works for me (even with en_US)

import locale
import functools

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

words = sorted(words, key=functools.cmp_to_key(locale.strcoll))

It also sorts lower case and upper case chars in different way - a A b B instead of A B a b

EDIT:

And for pandas I found

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

df.sort_values(key=lambda col:col.map(locale.strxfrm))

because it needs function which (as you can read in doc for pandas.DataFrame.sort_values()):

it should be vectorized. It should expect a Series and return a Series with the same shape as the input.

Maybe it will work also with polars.

It works also with standard sorted()

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

words = sorted(words, key=locale.strxfrm)

Code used for tests. Last sorting gives expected result.

import locale
import functools

words = 'ą a b ć c d A B C'.split()

print('before setlocal (no key)  :', sorted(words))
print('before setlocal (with key):', sorted(words, key=functools.cmp_to_key(locale.strcoll)))

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

print(' after setlocal (no key)  :', sorted(words))
print(' after setlocal (with key):', sorted(words, key=functools.cmp_to_key(locale.strcoll)))

Result:

before setlocal (no key)  : ['A', 'B', 'C', 'a', 'b', 'c', 'd', 'ą', 'ć']
before setlocal (with key): ['A', 'B', 'C', 'a', 'b', 'c', 'd', 'ą', 'ć']
 after setlocal (no key)  : ['A', 'B', 'C', 'a', 'b', 'c', 'd', 'ą', 'ć']
 after setlocal (with key): ['a', 'A', 'ą', 'b', 'B', 'c', 'C', 'ć', 'd']

EDIT:

Code for pandas

import locale
import functools
import pandas as pd

words = 'ą a b ć c d A B C'.split()
df = pd.DataFrame(words, columns=['words'])

print(df.sort_values('words'))
print(df.sort_values('words', key=lambda col:col.map(locale.strxfrm)))

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

print(df.sort_values('words'))
print(df.sort_values('words', key=lambda col:col.map(locale.strxfrm)))

EDIT:

Code which doesn't use key= in sort() but it creates column with values locale.strxfrm() and use it to sort rows. And later drops this column. Maybe it will works with polars which don't have key in sort()

BTW: locale.strxfrm() creates strange strings - I see Chinese characters :) - but it works.

import locale
import functools
import pandas as pd

words = 'ą a b ć c d A B C'.split()
df = pd.DataFrame(words, columns=['words'])

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

df['xfrm'] = df['words'].apply(locale.strxfrm)
df = df.sort_values('xfrm')
#print(df)
df = df.drop('xfrm', axis=1)
print(df)

EDIT:

Useful information from @Andj comment:

Fortunately en_US.UTF-8 is a safe locale. For other locales order may be different across different operating systems. For instance fr_FR.UTF-8 will sort very differently on Alpine Linux and macOS compared to Debian and Windows. For pandas there are few ways of handling things depending on what you want to achieve. A while ago I through together some rough notes on different approaches I was using at the time: https://github.com/enabling-languages/python-i18n/blob/main/notebooks%2Fsorting_pandas.ipynb

Diacritic-insensitive sorting with polars in python

Tags:

python

sorting

diacritics

accent-insensitive

python-polars

mavmav0

1 Answers

furas

Recent Activity

Donate For Us

Diacritic-insensitive sorting with polars in python

Tags:

python

sorting

diacritics

accent-insensitive

python-polars

mavmav0

1 Answers

furas

Related questions

Recent Activity

Donate For Us