My objective is to leverage some of Lucene's many tokenizers and filters to transform input text, but without the creation of any indexes.
For example, given this (contrived) input string...
" Someone’s - [texté] goes here, foo . "
...and a Lucene analyzer like this...
Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("lowercase")
        .addTokenFilter("icuFolding")
        .build();
I want to get the following output:
someone's texte goes here foo
The below Java method does what I want.
But is there a better (i.e. more typical and/or concise) way that I should be doing this?
I am specifically thinking about the way I have used TokenStream and CharTermAttribute, since I have never used them like this before. Feels clunky.
Here is the code:
Lucene 8.3.0 imports:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
My method:
private String transform(String input) throws IOException {
    Analyzer analyzer = CustomAnalyzer.builder()
            .withTokenizer("icu")
            .addTokenFilter("lowercase")
            .addTokenFilter("icuFolding")
            .build();
    TokenStream ts = analyzer.tokenStream("myField", new StringReader(input));
    CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);
    StringBuilder sb = new StringBuilder();
    try {
        ts.reset();
        while (ts.incrementToken()) {
            sb.append(charTermAtt.toString()).append(" ");
        }
        ts.end();
    } finally {
        ts.close();
    }
    return sb.toString().trim();
}
I have been using this set-up for a few weeks without issue. I have not found a more concise approach. I think the code in the question is OK.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With