I'm experiencing serious slow downs when working with dictionaries as the dictionary grows to a few thousands keys.
I'm processing a file with ~1,000,000 lines of data, I'm constructing a graph like data structure using dictionaries
here's my bottle neck function
def create_edge(node_a, node_b, graph):
    if node_a not in graph.keys():
        graph[node_a] = {node_b: 1}
    elif node_b in graph[node_a].keys():
        graph[node_a][node_b] += 1
    else:
        graph[node_a][node_b] = 1
create_edge will create and edge from node_a to node_b, or add 1 to the weight of an already existing edge between them.
Since my nodes are identified by a string unique id, I'm using dictionaries for storage, assuming that search if a key exist, and insert would take O(1) on average.
If I comment out create_edge I can process around 20,000 records per seconds, with create_edge as a part of my pipeline it's about 20 records per sec.
The first 100 records takes about 500ms to process.
When the dictionary size grows to around 10,000 - Processing 100 records take about 15,000ms, every record process invokes create_edge about 4 times on average - So 400 calls for create_edge takes 15 seconds when the dictionary size is 10,000.
First, do these runtimes make sense? Seems way to much to me, correct me if I'm wrong.
Second, suggestions of optimizing the dictionary usage for better run time would be greatly appreciated.
I'm expecting the dictionary size to be at least 100,000 to complete processing for the entire 1,000,000 records.
Edit: Conclusions
You were right on the money, did two noob mistakes here.. :)
The keys() calls increase complexity dramaticly, taking it from constant time to poly time (squared) per edge insertion, replacing if node in graph.keys()  with if node in graph produces a constant process time of 100 records in ~300ms.
Second mistake was virtualenv config which led me to believe im using python3 while I was actualy using python2.
python3 do optimizes the keys() code into a constant time search, which is good for run time but less for proper code style.
I performed a run time comparison after removing the keys() calls. 
# graph = {}
python version: 3.6.3
start time 11:44:56
Number of records: 1029493
graph created, 1231630 nodes
end time 11:50:35
total ~05:39
# graph = defaultdict(lambda : defaultdict(int))
python version: 3.6.3
start time 11:54:52
Number of records: 1029493
graph created, 1231630 nodes
end time  12:00:34
total ~05:42
# graph = {}
python version: 2.7.10
start time 12:03:25
Number of records: 1029493
graph created, 1231630 nodes
end time 12:09:40
total ~06:15
Can't u just use a defaultdict with defaultdict(int) as seen here: Python: defaultdict of defaultdict?
from collections import defaultdict
graph = defaultdict(lambda : defaultdict(int))
graph['a']['b'] += 1
graph['a']['b'] += 1
graph['a']['c'] += 1
graph
Returns:
defaultdict(<function __main__.<lambda>>,
            {'a': defaultdict(int, {'b': 2, 'c': 1})})
# equal to: {'a': {'b': 2, 'c': 1}}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With