I have a list of dictionaries in Python 3.5.2 that I am attempting to "deduplicate". All of the dictionaries are unique, but there is a specific key I would like to deduplicate on, keeping the dictionary with the most non-null values.
For example, I have the following list of dictionaries:
d1 = {"id":"a", "foo":"bar", "baz":"bat"}
d2 = {"id":"b", "foo":"bar", "baz":None}
d3 = {"id":"a", "foo":"bar", "baz":None}
d4 = {"id":"b", "foo":"bar", "baz":"bat"}
l = [d1, d2, d3, d4]
I would like to filter l to just dictionaries with unique id keys, keeping the dictionary that has the fewest nulls. In this case the function should keep d1 and d4.
What I attempted was to create a new key,val pair for "value count" like so:
for d in l:
d['val_count'] = len(set([v for v in d.values() if v]))
now what I am stuck on is how to go about filtering my list of dicts for unique ids where the val_count key is the greater value.
I am open to other approaches, but I am unable to use pandas for this project due to resource constraints.
Expected output:
l = [{"id":"a", "foo":"bar", "baz":"bat"},
{"id":"b", "foo":"bar", "baz":"bat"}]
I would use groupby and just pick the first one from each group:
1) First sort your list by key (to create the groups) and descending count of nulls (your stated goal):
>>> l2=sorted(l, key=lambda d: (d['id'], -sum(1 for v in d.values() if v)))
2) Then group by id and take the first element of each iterator presented as d in the groupby on the sorted list:
>>> from itertools import groupby
>>> [next(d) for _,d in groupby(l2, key=lambda _d: _d['id'])]
[{'id': 'a', 'foo': 'bar', 'baz': 'bat'}, {'id': 'b', 'foo': 'bar', 'baz': 'bat'}]
If you want a 'tie breaker' to select the first dict if otherwise they have the same null count, you can add an enumerate decorator:
>>> l2=sorted(enumerate(l), key=lambda t: (t[1]['id'], t[0], -sum(1 for v in t[1].values() if v)))
>>> [next(d)[1] for _,d in groupby(l2, key=lambda t: t[1]['id'])]
I doubt that additional step is actually necessary though since Python's sort (and sorted) is a stable sort and the sequence will only change from list order based on the key and void counts. So use the first version unless you are sure you need to use the second.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With