Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoid duplicate documents in Elasticsearch

I parse documents from a JSON, which will be added as children of a parent document. I just post the items to the index, without taking care about the id.

Sometimes there will be updates to the JSON and items will be added to it. So e.g. I parsed 2 documents from the JSON and after a week or two I parse the same JSON again. This time the JSON contains 3 documents.

I found answers like: 'remove all children and insert all items again.', but I doubt this is the solution I'm looking for.

I could compare each item to the children of my target-parent and add new documents, if there is no equal child.

I wondered if there is a way, to let elasticsearch handle duplicates.

like image 739
Goot Avatar asked Dec 29 '25 01:12

Goot


1 Answers

Duplication needs to be handled in ID handling itself. Choose a key that is unique for a document and make that as the _id. In the the key is too large or it is multiple keys , create a SHAH checksum out of it and make that as the _id.

If you already have dedupes in the database , you can use terms aggregation nested with top_hits aggregation to detect those.

You can read more about this approach here.

like image 58
Vineeth Mohan Avatar answered Dec 31 '25 13:12

Vineeth Mohan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!