Join two defaultdicts in Python

Question

I parsed a huge database of bibliographic records (about 20 million records). Each record has unique ID field, a set of authors and a set of term/keywords that describe main content of the bibliographic record. For example, a typical bibliographic record looks like:

ID: 001
Author: author1
Author: author2
Term: term1
Term: term2

First, I create two defaultdicts to store authors and terms:

d1 = defaultdict(lambda : defaultdict(list))
d2 = defaultdict(lambda : defaultdict(list))

Next, I populate authors:

d1['id001'] = ['author1', 'author2'] 
d1['id002'] = ['author3'] 
d1['id003'] = ['author1', 'author4']

and keywords:

d2['id001'] = ['term1', 'term2']  
d2['id002'] = ['term2', 'term3']
d2['id003'] = ['term4']

The problem is how to join these two dictionaries to obtain data object which links between authors and terms directly:

author1|term1,term2,term4
author2|term1,term2
author3|term2,term3
author4|term4

I have two questions:

Is proposed approach appropriate or should I store/represent data in some other way?
Could you please roughly suggest how to join both dictionaries?

jpp · Accepted Answer

This is one way. Note, as demonstrated below, you do not need to use nested dictionaries or a defaultdict for your initial step.

from collections import defaultdict

d1 = {}
d2 = {}

d1['id001'] = ['author1', 'author2'] 
d1['id002'] = ['author3'] 
d1['id003'] = ['author1', 'author4'] 

d2['id001'] = ['term1', 'term2']  
d2['id002'] = ['term2', 'term3']
d2['id003'] = ['term4']

res = defaultdict(list)

for ids in set(d1) & set(d2):
    for v in d1[ids]:
        res[v].extend(d2[ids])

res = {k: sorted(v) for k, v in res.items()}

# {'author1': ['term1', 'term2', 'term4'],
#  'author2': ['term1', 'term2'],
#  'author3': ['term2', 'term3'],
#  'author4': ['term4']}

Jean-François Fabre · Answer

The key of those problems is to build temporary dictionaries "properly oriented" from the existing ones. Once that is done, it's much clearer (and the complexity is good thanks to proper dict lookup)

Here's my solution:

First create a dict author => ids from d1.

Then create the result (a dict author => terms). Loop in the created author => ids dict and populate the result with the flattened values of d2.

d1=dict()
d2=dict()

d1['id001'] = ['author1', 'author2']
d1['id002'] = ['author3']
d1['id003'] = ['author1', 'author4']

d2['id001'] = ['term1', 'term2']
d2['id002'] = ['term2', 'term3']
d2['id003'] = ['term4']

import collections

authors_id = collections.defaultdict(list)
for k,v in d1.items():
    for a in v:
        authors_id[a].append(k)

print(dict(authors_id)) # convert to dict for clearer printing


authors_term = collections.defaultdict(list)
for k,v in authors_id.items():
    for a in v:
        for i in d2[a]:
            authors_term[k].append(i)

print(dict(authors_term)) # convert to dict for clearer printing

result:

{'author4': ['id003'], 'author3': ['id002'], 'author1': ['id001', 'id003'], 'author2': ['id001']}
{'author3': ['term2', 'term3'], 'author4': ['term4'], 'author1': ['term1', 'term2', 'term4'], 'author2': ['term1', 'term2']}

Join two defaultdicts in Python

Tags:

python

dictionary

defaultdict

Andrej

2 Answers

jpp

Jean-François Fabre

Recent Activity

Donate For Us

Join two defaultdicts in Python

Tags:

python

dictionary

defaultdict

Andrej

2 Answers

jpp

Jean-François Fabre

Related questions

Recent Activity

Donate For Us