I have an index from a large corpus with several fields. Only one these fields contain text. I need to extract the unique words from the whole index based on this field. Does anyone know how I can do that with Lucene in java?
As of Lucene 7+ the above and some related links are obsolete.
Here's what's current:
// IndexReader has leaves, you'll iterate through those
int leavesCount = reader.leaves().size();
final String fieldName = "content";
for(int l = 0; l < leavesCount; l++) {
  System.out.println("l: " + l);
  // specify the field here ----------------------------->
  TermsEnum terms = reader.leaves().get(l).reader().terms(fieldName).iterator();
  // this stops at 20 just to sample the head
  for(int i = 0; i < 20; i++) {
    // and to get it out, here -->
    final Term content = new Term(fieldName, BytesRef.deepCopyOf(terms.next()));
    System.out.println("i: " + i + ", term: " + content);
  }
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With