Problem:
If I search for "iphone" I get 400 product results and the product category aggregation I have returns the top 3 categories in the results set.
Those categories would include smartphones, phone cases and mobile phone accessories.
If I search "iphone 6" I get 1400 results because of the extra "6" returns matches to more products. The product category aggregation now returns the top 3 categories for all those results.
The top 3 product categories will now be everything from cables to computer monitors.
What I need to do is get the top 3 categories for the top 100 results.
What I've tried:
I've tried using the top_hits aggregation within the top category aggregation but that only returns the top products in each category.
Something like this:
{
    "aggs": {
        "product_categories": {
            "terms": {
                "field": "product_category",
                "size": 10,
            }
        }        
        "aggs": {
            "top-categories": {
                "top_hits": {
                    "size" : 3
                }
            }
        }
    }
}
I've also tried creating a top_hits aggregation with a sub-aggregation within to get the top categories but that doesn't work either.
{
    "aggs": {
        "top-categories": {
            "top_hits": {
                "size" : 100
            }
            "aggs": {
                "product_categories": {
                    "terms": {
                        "field": "product_category",
                        "size": 3,
                    }
                }
            }
        }
    }
}
Can anyone help me with this problem?
You could try using a filter aggregation based on a limit filter, and nest your terms aggregation in it. 
Be aware that the limit is applied at shard level (see the documentation).
However, this should do the job for your case, with a query like :
{
  "aggs": {
    "limit_results": {
      "filter": {
        "limit": {
          "value": 100
        }
      },
      "aggs": {
        "product_categories": {
          "terms": {
            "field": "product_category",
            "size": 10
          }
        }
      }
    }
  }
}
Before I begin, please note that this not a perfect solution to the question. However, it could definitively ease the situation and in a special case it actually is a perfect solution.
The solution I propose goes by sorting the terms aggregation buckets by the score of the document they were found in. That is, the ordering of the terms is no longer only by frequency but also by document score.
Here is an example request:
{
   "query": {
       "query_string": {
           "default_field": "product_title",
           "query": "iphone 6"
       }
   },
   "aggs": {
       "product_categories": {
           "terms": {
               "field": "product_category",
               "order": {
                   "max_score": "desc",
                   "_count": "desc"
               },
               "size": 3
           },
           "aggs": {
               "max_score": {
                   "max": {
                       "script": "_score"
                   }
               }
           }
       }
   }
}
Please note the "order" property of the terms aggregation. It specifies a path to the max_score aggregation which in turn just returns the special _score field which disposes the score of each hit document of the query. It does ALSO use the frequency of each time via the "_count" property on second position.
This request will give you the three terms in the product_category field that are the best of "very frequent and from highly ranked documents". I cannot say more explicitly how the ranking is done. I noticed in preliminary experiments that the result does not monotonously enumerate document scores but may "jump over" a quite highly ranked document when it only includes terms of low frequency - which actually might be what you want for your usecase. The documentation for this kind of ordering is found here: http://www.elastic.co/guide/en/elasticsearch/reference/1.x/search-aggregations-bucket-terms-aggregation.html
There is also an example in the above linked documentation for ordering by multiple criteria and just says "The above will sort the countries buckets based on the average height among the female population and then by their doc_count in descending order". My impression was it could be some kind of harmonic mean or something. Perhaps better look for yourself whether you find the results of this approach useful.
The special case I spoke of at the beginning is when each document has exactly one value in the requested field. In this case, you actually get the top N terms for the top N (because N is equal) documents when you leave out the "_count" ordering.
You are looking for Sampler Aggregation. I have a similar answer at Aggregation on top n results
{
  "aggs": {
    "bestDocs": {
       "sampler": {
            "shard_size":100
         },
       "aggs": {
          "product_categories": {
             "terms": {
                "field": "product_category",
                "size": 3
             }
          }
       } 
   }
}
It will take the top 100 docs sorted by their scores and then do term aggregation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With