Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find distinct inner objects in Elasticsearch

We're trying to find distinct inner objects in Elasticsearch. This would be a minimum example for our case. We're stuck with something like the following mapping (changing types or indices or adding new fields wouldn't be a problem, but the structure should remain as it is):

{
  "building": {
    "properties": {
      "street": {
        "type": "string",
        "store": "yes",
        "index": "not_analyzed"
      },
      "house number": {
        "type": "string",
        "store": "yes",
        "index": "not_analyzed"
      },
      "city": {
        "type": "string",
        "store": "yes",
        "index": "not_analyzed"
      },
      "people": {
        "type": "object",
        "store": "yes",
        "index": "not_analyzed",
        "properties": {
          "firstName": {
            "type": "string",
            "store": "yes",
            "index": "not_analyzed"
          },
          "lastName": {
            "type": "string",
            "store": "yes",
            "index": "not_analyzed"
          }
        }
      }
    }
  }
}

Assuming we have this example data:

{
  "buildings": [
    {
      "street": "Baker Street",
      "house number": "221 B",
      "city": "London",
      "people": [
        {
          "firstName": "John",
          "lastName": "Doe"
        },
        {
          "firstName": "Jane",
          "lastName": "Doe"
        }
      ]
    },
    {
      "street": "Baker Street",
      "house number": "5",
      "city": "London",
      "people": [
        {
          "firstName": "John",
          "lastName": "Doe"
        }
      ]
    },
    {
      "street": "Garden Street",
      "house number": "1",
      "city": "London",
      "people": [
        {
          "firstName": "Jane",
          "lastName": "Smith"
        }
      ]
    }
  ]
}

When we query for the street "Baker Street" (and whatever additional options needed), we expect to get the following list:

[
    {
      "firstName": "John",
      "lastName": "Doe"
    },
    {
      "firstName": "Jane",
      "lastName": "Doe"
    }
]

The format does not matter too much, but we should be able to parse the first and last name. Just, as our actual data-set is much larger, we need the entries to be distinct.

We are using Elasticsearch 1.7.

like image 523
soniro Avatar asked Nov 05 '25 16:11

soniro


1 Answers

We finally solved our problem.

Our solution is (as we expected) a pre-calculated people_all field. But instead of using copy_to or transform we're just writing it as we are writing the other fields when importing our data. The field looks as follows:

"people": {
  "type": "nested",
  ..
  "properties": {
    "firstName": {
      "type": "string",
      "store": "yes",
      "index": "not_analyzed"
    },
    "lastName": {
      "type": "string",
      "store": "yes",
      "index": "not_analyzed"
    },
    "people_all": {
      "type": "string",
      "index": "not_analyzed"
    }
  }
}

Please pay attention on the "index": "not_analyzed" at the people_all field. This is important to have complete buckets. If you don't use it, our example will return 3 buckets "john", "jane" and "doe".

After writing this new field we can run an aggragetion as follows:

{
  "size": 0,
  "query": {
    "term": {
      "street": "Baker Street"
    }
  },
  "aggs": {
    "people_distinct": {
      "nested": {
        "path": "people"
      },
      "aggs": {
        "people_all_distinct": {
          "terms": {
            "field": "people.people_all",
            "size": 0
          }
        }
      }
    }
  }
}

And we return the following response:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "people_distinct": {
      "doc_count": 3,
      "people_name_distinct": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "John Doe",
            "doc_count": 2
          },
          {
            "key": "Jane Doe",
            "doc_count": 1
          }
        ]
      }
    }
  }
}

Out of the buckets in the response we are now able to create the distinct people objects.

Please let us know if there is a better way to reach our goal. Parsing the buckets is not an optimal solution and it would be more fancy to have the fields firstName and lastName in each bucket.

like image 99
soniro Avatar answered Nov 07 '25 09:11

soniro



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!