Can I get PostgreSQL to sort rows by a string column respecting the accents?
I found out that it's possible to define a custom collation having "ks" (colStrength) set to "level2", which would mean that it's accent-sensitive.
However, when I try to actually sort using that collation, the order seem to be accent-insensitive.
There is an extensive blog post about this by a PostgreSQL developer, let's use the same ICU locale) like so:
CREATE TABLE test (string text);
INSERT INTO test VALUES ('bar'), ('bat'), ('bär');
CREATE COLLATION "und1" (provider = icu, deterministic = false, locale = 'und-u-ks-level1');
CREATE COLLATION "und2" (provider = icu, deterministic = false, locale = 'und-u-ks-level2');
CREATE COLLATION "und3" (provider = icu, deterministic = false, locale = 'und-u-ks-level3');
SELECT * FROM test ORDER BY string collate "und1";
SELECT * FROM test ORDER BY string collate "und2";
SELECT * FROM test ORDER BY string collate "und3";
All three collations give me the same order: bar
< bär
< bat
, although an accent-sensitive order would be bar
< bat
< bär
Do I misunderstand the collation capabilities? Is there a way to get an accent-sensitive order?
Also, is there a way to see what options are there for the default built-in collations? I don't see, for example, the used "ks" level in the pg_collation
table data.
Yes, PostgreSQL can sort strings accent-sensitively using ICU collations, but there are a few important nuances to get it working correctly.
You're correctly using ICU collations with ks=level2
, which should enable accent-sensitive comparisons. However, the und
locale (undetermined language) may not provide the sorting behavior you're expecting. ICU needs a language context to apply proper collation rules.
Instead of using und
, try using a real language locale, such as en-u-ks-level2
for English or fr-u-ks-level2
for French, depending on the language context of your data.
CREATE COLLATION "en_level2" (provider = icu, deterministic = false, locale = 'en-u-ks-level2');
SELECT * FROM test ORDER BY string COLLATE "en_level2";
CREATECOLLATION "en_level2" (provider = icu, deterministic = false, locale = 'en-u-ks-level2'); SELECT * FROM test ORDER BY string COLLATE "en_level2";
This should result in the expected order: bar < bat < bär
.
und
doesn’t workThe und
locale often defaults to root collation rules, which may not define strong enough rules for distinguishing accents. Using a specific language gives ICU more context for handling accent-sensitive and locale-specific rules.
You can list all available ICU collations with:
SELECT * FROM pg_collation WHERE provider = 'icu';
SELECT* FROM pg_collation WHERE provider = 'icu';
Unfortunately, the pg_collation
catalog does not expose the ICU options like ks
, but you can infer them from the locale
field.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With