I wonder whether there are recommendations for using analyzers / filters to index/search for human names.
Examples of names that might pose difficulties:
thx Marc
Here's an analyzer and filter to get you started. It's hard to cover all the cases, but an asciifolding filter will solve your issues with the François versus Francois case.
In the example below, it will preserve the original so that a query for both François and Francois will resolve to the same resultset.
"analyzer": {
"name_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"my_ascii_folding"
]
}
},
"filter": {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
Source
By defining a synonym filter, you can define a list of commonly similar names in your language (maybe a line like François => Francois in your synonyms file for example) that will do the trick in the short run.
Lastly a pattern_replace char filter with a "([A-Za-z]+)ae([A-Za-z]+)" => "$1a$2" pattern can turn all the Verhaeven into Verhaven
Something like...
"char_filter": {
"ae_char_filter": {
"type": "pattern_replace",
"pattern": "([A-Za-z]+)ae([A-Za-z]+)",
"replacement": "$1a$2"
}
}
Even Peter de Langhe versus Peter delange can be solved with a pattern_replace char filter:
"char_filter": {
"de_char_filter": {
"type": "pattern_replace",
"pattern": "([A-Za-z]+) de ([A-Za-z]+)",
"replacement": "$1 de$2"
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With