Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement a basic Analyzer in Lucene 4.2.1?

Lucene 4.2.1 doesnot have StandardAnalyzer, and I am not sure how to implement a basic analyzer that does not alter the source text. Any pointers?

final SimpleFSDirectory DIRECTORY = new SimpleFSDirectory(new File(ELEMENTS_INDEX_DIR));
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_42, new Analyzer() {
        @Override
        protected TokenStreamComponents createComponents(String s, Reader reader) {
            return null;
        }
    });
    IndexWriter indexWriter = new IndexWriter(DIRECTORY, indexWriterConfig);
    List<ModelObject> elements = dao.getAll();
    for (ModelObject element : elements) {
        Document document = new Document();
        document.add(new StringField("id", String.valueOf(element.getId()), Field.Store.YES));
        document.add(new TextField("name", element.getName(), Field.Store.YES));
        indexWriter.addDocument(document);
    }
    indexWriter.close();
like image 458
Isaac Avatar asked Dec 04 '25 16:12

Isaac


1 Answers

You have to return a TokenStreamComponents from createComponents. null is not adequate.

However, Lucene 4.2.1 certainly does have StandardAnalyzer.

If you are, perhaps, refering to the changes in StandardAnalyzer in Lucene 4.x, and are looking for the old StandardAnalyzer, then you'll want ClassicAnalyzer.

If you really want a trimmed down Analyzer that doesn't modify anything, but just tokenizes in a very simple fashion, perhaps WhitespaceAnalyzer will serve your purposes.

If ou don't want it modified or tokenized at all, then KeywordAnalyzer.

And if you must create your very own Analyzer, as you say, then override the method createComponents, and actually build and return an instance of TokenStreamComponents. If none of the above four serve your needs, I have no idea what your needs are, and so I won't make an attempt a specific example here, but here is the example from the Analyzer docs

Analyzer analyzer = new Analyzer() {
 @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer source = new FooTokenizer(reader);
    TokenStream filter = new FooFilter(source);
    filter = new BarFilter(filter);
    return new TokenStreamComponents(source, filter);
  }
};

There is a single arg ctor for TokenStreamComponents as well, so the filter is optional, by the way.

like image 67
femtoRgon Avatar answered Dec 06 '25 07:12

femtoRgon