Avoid duplicate documents in Elasticsearch

Question

I parse documents from a JSON, which will be added as children of a parent document. I just post the items to the index, without taking care about the id.

Sometimes there will be updates to the JSON and items will be added to it. So e.g. I parsed 2 documents from the JSON and after a week or two I parse the same JSON again. This time the JSON contains 3 documents.

I found answers like: 'remove all children and insert all items again.', but I doubt this is the solution I'm looking for.

I could compare each item to the children of my target-parent and add new documents, if there is no equal child.

I wondered if there is a way, to let elasticsearch handle duplicates.

Vineeth Mohan · Accepted Answer

Duplication needs to be handled in ID handling itself. Choose a key that is unique for a document and make that as the _id. In the the key is too large or it is multiple keys , create a SHAH checksum out of it and make that as the _id.

If you already have dedupes in the database , you can use terms aggregation nested with top_hits aggregation to detect those.

You can read more about this approach here.

Avoid duplicate documents in Elasticsearch

Tags:

elasticsearch

Goot

1 Answers

Vineeth Mohan

Recent Activity

Donate For Us

Avoid duplicate documents in Elasticsearch

Tags:

elasticsearch

Goot

1 Answers

Vineeth Mohan

Related questions

Recent Activity

Donate For Us