There is a huge collection with 600.000 documents. Unfortunatly there are duplicates, which I want to find. These duplicates differs only in first letter upper/lower case.
{ key: 'Find me' },
{ key: 'find me' },
{ key: 'Don't find me }, // just one document for this string
{ key: 'don't find me either } // just one document for this string
Now I want to get all duplicates, which means there is an existing uppercase AND lowercase string.
In MongoDB, there is a $toLower
transformation available that you can use.
Here is a way to output every key appearing more than once (you need to change db.collection
by the name of your collection):
db.collection.aggregate([
{ $group:
{
_id: { $toLower: "$key" },
cnt: { "$sum": 1 }
}
},
{ $match:
{ cnt: {$gt: 1 } }
}
])
First, the $group
groups the documents by key
(case insensitive). The number of documents for each key is accumulated in cnt
. For after the $group
, you end up with something like:
{"key": "find me", "cnt": 2}
{"key": "other key", "cnt": 1}
...
Then, the $match
filters those results, retaining only the ones with a cnt
greated than 1.
Note: above is the code for the mongo shell. You can do pretty much the same from javascript (using the mongodb driver), but you need to add quotes around $group
and such.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With