Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory efficient way of keeping track of unique entries

Tags:

python

I have around a 50GB folder full of files. Each file consists of line after line of JSON data and in this JSON structure is a field for user_id.

I need to count the number of unique User IDs across all of the files (and only need the total count). What is the most memory efficient and relatively quick way of counting these?

Of course, loading everything into a huge list maybe isn't the best option. I tried pandas but it took quite a while. I then tried to simple write the IDs to text files but I thought I'd find out if I was maybe missing something far simpler.

like image 269
Alexander Hepburn Avatar asked Oct 21 '25 17:10

Alexander Hepburn


2 Answers

Since it was stated that the JSON context of user_id does not matter, we just treat the JSON files as the pure text files they are.

GNU tools solution

I'd not use Python at all for this, but rather rely on the tools provided by GNU, and pipes:

cat *.json | sed -nE 's/\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*/\1/p' | sort -un --parallel=4 | wc -l
  • cat *.json: Output contents of all files to stdout
  • sed -nE 's/\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*/\1/p': Look for lines containting "user_id": "{number}" and only print the number to stdout
  • sort -un --parallel=4: Sort the output numerically, ignoring duplicates (i.e. output only unique values), using multiple (4) jobs, and output to stdout
  • wc -l: Count number of lines, and output to stdout

To determine whether the values are unique, we just sort them. You can speed up the sorting by specifying a higher number of parallel jobs, depending on your core count.

Python solution

If you want to use Python nonetheless, I'd recommend using a set and re (regular expressions)

import fileinput
import re

r = re.compile(r'\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*')

s = set()
for line in fileinput.input():
    m = r.match(line)
    if m:
        s.add(m.groups()[0])

print(len(s))

Run this using python3 <scriptname>.py *.json.

like image 173
Jan Christoph Terasa Avatar answered Oct 24 '25 08:10

Jan Christoph Terasa


Since you only need the user_ids, load a .json (as a data stucture), extract any ids, then destroy all references to that structure and any its parts so that it's garbage collected.

To speed up the process, you can do this in a few processes in parallel, take a look at multiprocessing.Pool.map.

like image 24
ivan_pozdeev Avatar answered Oct 24 '25 08:10

ivan_pozdeev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!