I have a large file (50 GB) and I could like to count the number of occurrences of different lines in it. Normally I'd use
sort bigfile | uniq -c
but the file is large enough that sorting takes a prohibitive amount of time and memory. I could do
grep -cfx 'one possible line'
for each unique line in the file, but this would mean n passes over the file for each possible line, which (although much more memory friendly) takes even longer than the original.
Any ideas?
A related question asks about a way to find unique lines in a big file, but I'm looking for a way to count the number of instances of each -- I already know what the possible lines are.
Use awk
awk '{c[$0]++} END {for (line in c) print c[line], line}' bigfile.txt
This is O(n) in time, and O(unique lines) in space.
Here is a solution using jq 1.5. It is essentially the same as the awk solution, both in approach and performance characteristics, but the output is a JSON object representing the hash. (The program can be trivially modified to produce output in an alternative format.)
Invocation:
$ jq -nR 'reduce inputs as $line ({}; .[$line] += 1)' bigfile.txt
If bigfile.txt consisted of these lines:
a
a
b
a
c
then the output would be:
{
  "a": 3,
  "b": 1,
  "c": 1
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With