Memory efficient way of keeping track of unique entries

Question

I have around a 50GB folder full of files. Each file consists of line after line of JSON data and in this JSON structure is a field for user_id.

I need to count the number of unique User IDs across all of the files (and only need the total count). What is the most memory efficient and relatively quick way of counting these?

Of course, loading everything into a huge list maybe isn't the best option. I tried pandas but it took quite a while. I then tried to simple write the IDs to text files but I thought I'd find out if I was maybe missing something far simpler.

Jan Christoph Terasa · Accepted Answer

Since it was stated that the JSON context of user_id does not matter, we just treat the JSON files as the pure text files they are.

GNU tools solution

I'd not use Python at all for this, but rather rely on the tools provided by GNU, and pipes:

cat *.json | sed -nE 's/\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*/\1/p' | sort -un --parallel=4 | wc -l

cat *.json: Output contents of all files to stdout
sed -nE 's/\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*/\1/p': Look for lines containting "user_id": "{number}" and only print the number to stdout
sort -un --parallel=4: Sort the output numerically, ignoring duplicates (i.e. output only unique values), using multiple (4) jobs, and output to stdout
wc -l: Count number of lines, and output to stdout

To determine whether the values are unique, we just sort them. You can speed up the sorting by specifying a higher number of parallel jobs, depending on your core count.

Python solution

If you want to use Python nonetheless, I'd recommend using a set and re (regular expressions)

import fileinput
import re

r = re.compile(r'\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*')

s = set()
for line in fileinput.input():
    m = r.match(line)
    if m:
        s.add(m.groups()[0])

print(len(s))

Run this using python3 <scriptname>.py *.json.

ivan_pozdeev · Answer

Since you only need the user_ids, load a .json (as a data stucture), extract any ids, then destroy all references to that structure and any its parts so that it's garbage collected.

To speed up the process, you can do this in a few processes in parallel, take a look at multiprocessing.Pool.map.

Memory efficient way of keeping track of unique entries

Tags:

python

Alexander Hepburn

2 Answers

GNU tools solution

Python solution

Jan Christoph Terasa

ivan_pozdeev

Recent Activity

Donate For Us

Memory efficient way of keeping track of unique entries

Tags:

python

Alexander Hepburn

2 Answers

GNU tools solution

Python solution

Jan Christoph Terasa

ivan_pozdeev

Related questions

Recent Activity

Donate For Us