I am using solr spellcheck for russian language. When you are typing with Cyrillic chars, everything it's ok, but it doesn't work when you are typing with Latin chars.
I want that spellcheck correct and when you are typing with Cyrillic chars and when are you typing with Latin chars. And corret to text with Cyrillic chars.
For example, when you type:
телевидениеее or televidenieee
It should correct to:
телевидение
schema.xml:
<fieldType name="spell_text" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
        <filter class="solr.LengthFilterFactory" min="3" max="256" />
    </analyzer>
</fieldType>
solrconfig.xml
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <lst name="spellchecker">
        <str name="name">default</str>
        <str name="field">spellcheck</str>
        <str name="classname">solr.IndexBasedSpellChecker</str>
        <str name="buildOnCommit">true</str>
        <str name="buildOnOptimize">true</str>
        <str name="spellcheckIndexDir">./spellchecker</str>
        <str name="accuracy">0.75</str>
    </lst>
    <lst name="spellchecker">
        <str name="name">wordbreak</str>
        <str name="field">spellcheck</str>
        <str name="classname">solr.WordBreakSolrSpellChecker</str>
        <str name="combineWords">false</str>
        <str name="breakWords">true</str>
        <int name="maxChanges">1</int>
    </lst>
</searchComponent>
Thanks for help
It can be achived with ICUTransformFilterFactory, which will (un)transliterate the input query each time.
Here is an example, of how one can enable this functionality:
Enable icu4j amalyzers (lucene-analyzers-icu-*.jar, icu4j-*.jar):
Those libraries can be found in contrib/analysis-extras folder of solr distribution from official site (they also available via maven).
In solrconfig.xml add something like these to enable them (there can be a single lib dir with all the jars that you need, in this example it just uses default location relative to example/solr/collection1/conf folder from official distribution):
<lib dir="../../../contrib/analysis-extras/lib" regex=".*\.jar" />
<lib dir="../../../contrib/analysis-extras/lucene-libs" regex=".*\.jar" />
Split spell_text field analyzers into two separate list for index and query.
Add solr.ICUTransformFilterFactory as query analyzer with the following id Any-Cyrillic; NFD; [^\p{Alnum}] Remove:
<fieldType name="spell_text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
    <filter class="solr.LengthFilterFactory" min="3" max="256" />
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
    <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
    <filter class="solr.LengthFilterFactory" min="3" max="256" />
    <filter class="solr.ICUTransformFilterFactory" id="Any-Cyrillic; NFD; [^\p{Alnum}] Remove" />
  </analyzer>
</fieldType>
Regarding the ICUTransformFilterFactory id - Any-Cyrillic; NFD; [^\p{Alnum}] Remove:
The configuration described above is working on my local machine the same way for russian transliterations and russian words
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With