I've got a mid-size elasticsearch index (1.46T or ~1e8 docs). It's running on 4 servers which each have 64GB Ram split evenly between elastic and the OS (for caching).
I want to try out the new "Significant terms" aggregation so I fired off the following query...
{
  "query": {
    "ids": {
      "type": "document",
      "values": [
        "xCN4T1ABZRSj6lsB3p2IMTffv9-4ztzn1R11P_NwTTc"
      ]
    }
  },
  "aggregations": {
    "Keywords": {
      "significant_terms": {
        "field": "Body"
      }
    }
  },
  "size": 0
}
Which should compare the body of the document specified with the rest of the index and find terms significant to the document that are not common in the index.
Unfortunately, this invariably results in a
ElasticsearchException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];
nested: UncheckedExecutionException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];
nested: CircuitBreakingException[Data too large, data would be larger than limit of [25741911654] bytes];
after a minute or two and seems to imply I haven't got enough memory.
The elastic servers in question are actually VMs, so I shut down other VMs and gave each elastic instance 96GB and each OS another 96GB.
The same problem occurred (different numbers, took longer). I haven't got hardware to hand with more than 192GB of memory available so can't go higher.
Are aggregations not meant for use against the index as a whole? Am I making a mistake with regards to the query format?
There is a warning on the documentation for this aggregation about RAM use on free-text fields for very large indices [1]. On large indices it works OK for lower-cardinality fields with a smaller vocabulary (e.g. hashtags) but the combination of many free-text terms and many docs is a memory-hog. You could look at specifying a filter on the loading of FieldData cache [2] for the Body field to trim the long-tail of low-frequency terms (e.g. doc frequency <2) which would reduce RAM overheads.
I have used a variation of this algorithm before where only a sample of the top-matching docs were analysed for significant terms and this approach requires less RAM as only the top N docs are read from disk and tokenised (using TermVectors or an Analyzer). However, for now the implementation in Elasticsearch relies on a FieldData cache and looks up terms for ALL matching docs.
One more thing - when you say you want to "compare the body of the document specified" note that the usual mode of operation is to compare a set of documents against the background, not just one. All analysis is based on doc frequency counts so with a sample set of just one doc all terms will have the foreground frequency of 1 meaning you have less evidence to reinforce any analysis.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With