Count word occurences in very large file (memory exhausted on running) grep -o foo | wc -l

Question

What options are there for making word counts on very large files?

I believe the whole file is on 1 line, which may be part of the problem as pointed out in one of the answers below.

In this case, I have an xml file of 1.7 Gb and trying to count some things inside it quickly.

I found this post Count number of occurrences of a pattern in a file (even on same line) and the approach works for me up to a certain size.

Up to 300Mb or so (40 000) occurrences was fine doing

cat file.xml | grep -o xmltag | wc -l

but above that size, I get "memory exhausted".

Ole Tange · Accepted Answer

How many newlines are in your file.xml? If one of your lines is extremely long, that might explain why grep fails with "grep: memory exhausted".

A solution to that is to introduce at places, where it does not matter. Say, before every </:

cat big.xml | perl -e 'while(sysread(STDIN,$buf, 32768)){ $buf=~s:</:
</:; syswrite(STDOUT,$buf); }'

GNU Parallel can chop the big file into smaller chunks. Again you will need to find good chopping places that are not in the middle of a match. For XML a good place will often be between > and <:

parallel -a big.xml --pipepart --recend '>' --recstart '<' --block 10M grep -o xmltag

Even better are end tags that represents end of a record:

parallel -a big.xml --pipepart --recend '</endrecord>' --block 10M grep -o xmltag

Note that --pipepart is a relatively new option, so you need version 20140622 or later.

Count word occurences in very large file (memory exhausted on running) grep -o foo | wc -l

Tags:

grep

large-files

word-count

user985366

1 Answers

Ole Tange

Recent Activity

Donate For Us

Count word occurences in very large file (memory exhausted on running) grep -o foo | wc -l

Tags:

grep

large-files

word-count

user985366

1 Answers

Ole Tange

Related questions

Recent Activity

Donate For Us