Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Lucene Analyzer Without Indexing - Is My Approach Reasonable?

Tags:

java

lucene

My objective is to leverage some of Lucene's many tokenizers and filters to transform input text, but without the creation of any indexes.

For example, given this (contrived) input string...

" Someone’s - [texté] goes here, foo . "

...and a Lucene analyzer like this...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("lowercase")
        .addTokenFilter("icuFolding")
        .build();

I want to get the following output:

someone's texte goes here foo

The below Java method does what I want.

But is there a better (i.e. more typical and/or concise) way that I should be doing this?

I am specifically thinking about the way I have used TokenStream and CharTermAttribute, since I have never used them like this before. Feels clunky.

Here is the code:

Lucene 8.3.0 imports:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.custom.CustomAnalyzer;

My method:

private String transform(String input) throws IOException {

    Analyzer analyzer = CustomAnalyzer.builder()
            .withTokenizer("icu")
            .addTokenFilter("lowercase")
            .addTokenFilter("icuFolding")
            .build();

    TokenStream ts = analyzer.tokenStream("myField", new StringReader(input));
    CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);

    StringBuilder sb = new StringBuilder();
    try {
        ts.reset();
        while (ts.incrementToken()) {
            sb.append(charTermAtt.toString()).append(" ");
        }
        ts.end();
    } finally {
        ts.close();
    }
    return sb.toString().trim();
}
like image 468
andrewJames Avatar asked Sep 12 '25 10:09

andrewJames


1 Answers

I have been using this set-up for a few weeks without issue. I have not found a more concise approach. I think the code in the question is OK.

like image 69
andrewJames Avatar answered Sep 14 '25 00:09

andrewJames