I parsed a huge database of bibliographic records (about 20 million records). Each record has unique ID field, a set of authors and a set of term/keywords that describe main content of the bibliographic record. For example, a typical bibliographic record looks like:
ID: 001
Author: author1
Author: author2
Term: term1
Term: term2
First, I create two defaultdict
s to store authors and terms:
d1 = defaultdict(lambda : defaultdict(list))
d2 = defaultdict(lambda : defaultdict(list))
Next, I populate authors:
d1['id001'] = ['author1', 'author2']
d1['id002'] = ['author3']
d1['id003'] = ['author1', 'author4']
and keywords:
d2['id001'] = ['term1', 'term2']
d2['id002'] = ['term2', 'term3']
d2['id003'] = ['term4']
The problem is how to join these two dictionaries to obtain data object which links between authors and terms directly:
author1|term1,term2,term4
author2|term1,term2
author3|term2,term3
author4|term4
I have two questions:
This is one way. Note, as demonstrated below, you do not need to use nested dictionaries or a defaultdict
for your initial step.
from collections import defaultdict
d1 = {}
d2 = {}
d1['id001'] = ['author1', 'author2']
d1['id002'] = ['author3']
d1['id003'] = ['author1', 'author4']
d2['id001'] = ['term1', 'term2']
d2['id002'] = ['term2', 'term3']
d2['id003'] = ['term4']
res = defaultdict(list)
for ids in set(d1) & set(d2):
for v in d1[ids]:
res[v].extend(d2[ids])
res = {k: sorted(v) for k, v in res.items()}
# {'author1': ['term1', 'term2', 'term4'],
# 'author2': ['term1', 'term2'],
# 'author3': ['term2', 'term3'],
# 'author4': ['term4']}
The key of those problems is to build temporary dictionaries "properly oriented" from the existing ones. Once that is done, it's much clearer (and the complexity is good thanks to proper dict lookup)
Here's my solution:
First create a dict author => ids from d1
.
Then create the result (a dict author => terms). Loop in the created author => ids dict and populate the result with the flattened values of d2
.
d1=dict()
d2=dict()
d1['id001'] = ['author1', 'author2']
d1['id002'] = ['author3']
d1['id003'] = ['author1', 'author4']
d2['id001'] = ['term1', 'term2']
d2['id002'] = ['term2', 'term3']
d2['id003'] = ['term4']
import collections
authors_id = collections.defaultdict(list)
for k,v in d1.items():
for a in v:
authors_id[a].append(k)
print(dict(authors_id)) # convert to dict for clearer printing
authors_term = collections.defaultdict(list)
for k,v in authors_id.items():
for a in v:
for i in d2[a]:
authors_term[k].append(i)
print(dict(authors_term)) # convert to dict for clearer printing
result:
{'author4': ['id003'], 'author3': ['id002'], 'author1': ['id001', 'id003'], 'author2': ['id001']}
{'author3': ['term2', 'term3'], 'author4': ['term4'], 'author1': ['term1', 'term2', 'term4'], 'author2': ['term1', 'term2']}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With