Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

More efficient way to get unique first occurrence from a Python dict

I have a very large file I'm parsing and getting the key value from the line. I want only the first key and value, for only one value. That is, I'm removing the duplicate values

So it would look like:

{
A:1
B:2
C:3
D:2
E:2
F:3
G:1
}

and it would output:

{E:2,F:3,G:1}

It's a bit confusing because I don't really care what the key is. So E in the above could be replaced with B or D, F could be replaced with C, and G could be replaced with A.

Here is the best way I have found to do it but it is extremely slow as the file gets larger.

mapp = {}
value_holder = []

for i in mydict:
 if mydict[i] not in value_holder:
   mapp[i] = mydict[i]
   value_holder.append(mydict[i])

Must look through value_holder every time :( Is there a faster way to do this?

like image 528
jwillis0720 Avatar asked Oct 19 '25 15:10

jwillis0720


1 Answers

Yes, a trivial change makes it much faster:

value_holder = set()

(Well, you also have to change the append to add. But still pretty simple.)

Using a set instead of a list means each lookup is O(1) instead of O(N), so the whole operation is O(N) instead of O(N^2). In other words, if you have 10,000 lines, you're doing 10,000 hash lookups instead of 50,000,000 comparisons.

One caveat with this solution—and all of the others posted—is that it requires the values to be hashable. If they're not hashable, but they are comparable, you can still get O(NlogN) instead of O(N^2) by using a sorted set (e.g., from the blist library). If they're neither hashable nor sortable… well, you'll probably want to find some way to generate something hashable (or sortable) to use as a "first check", and then only walk the "first check" matches for actual matches, which will get you to O(NM), where M is the average number of hash collisions.

You might want to look at how unique_everseen is implemented in the itertools recipes in the standard library documentation.

Note that dictionaries don't actually have an order, so there's no way to pick the "first" duplicate; you'll just get one arbitrarily. In which case, there's another way to do this:

inverted = {v:k for k, v in d.iteritems()}
reverted = {v:k for k, v in inverted.iteritems()}

(This is effectively a form of the decorate-process-undecorate idiom without any processing.)

But instead of building up the dict and then filtering it, you can make things better (simpler, and faster, and more memory-efficient, and order-preserving) by filtering as you read. Basically, keep the set alongside the dict as you go along. For example, instead of this:

mydict = {}
for line in f:
    k, v = line.split(None, 1)
    mydict[k] = v

mapp = {}
value_holder = set()

for i in mydict:
    if mydict[i] not in value_holder:
        mapp[i] = mydict[i]
        value_holder.add(mydict[i])

Just do this:

mapp = {}
value_holder = set()
for line in f:
    k, v = line.split(None, 1)
    if v not in value_holder:
        mapp[k] = v
        value_holder.add(v)

In fact, you may want to consider writing a one_to_one_dict that wraps this up (or search PyPI modules and ActiveState recipes to see if someone has already written it for you), so then you can just write:

mapp = one_to_one_dict()
for line in f:
    k, v = line.split(None, 1)
    mapp[k] = v
like image 72
abarnert Avatar answered Oct 21 '25 04:10

abarnert



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!