Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count word occurences in very large file (memory exhausted on running) grep -o foo | wc -l

What options are there for making word counts on very large files?

I believe the whole file is on 1 line, which may be part of the problem as pointed out in one of the answers below.

In this case, I have an xml file of 1.7 Gb and trying to count some things inside it quickly.

I found this post Count number of occurrences of a pattern in a file (even on same line) and the approach works for me up to a certain size.

Up to 300Mb or so (40 000) occurrences was fine doing

cat file.xml | grep -o xmltag | wc -l    

but above that size, I get "memory exhausted".

like image 660
user985366 Avatar asked Dec 06 '25 03:12

user985366


1 Answers

How many newlines are in your file.xml? If one of your lines is extremely long, that might explain why grep fails with "grep: memory exhausted".

A solution to that is to introduce \n at places, where it does not matter. Say, before every </:

cat big.xml | perl -e 'while(sysread(STDIN,$buf, 32768)){ $buf=~s:</:\n</:; syswrite(STDOUT,$buf); }'

GNU Parallel can chop the big file into smaller chunks. Again you will need to find good chopping places that are not in the middle of a match. For XML a good place will often be between > and <:

parallel -a big.xml --pipepart --recend '>' --recstart '<' --block 10M grep -o xmltag

Even better are end tags that represents end of a record:

parallel -a big.xml --pipepart --recend '</endrecord>' --block 10M grep -o xmltag

Note that --pipepart is a relatively new option, so you need version 20140622 or later.

like image 146
Ole Tange Avatar answered Dec 07 '25 21:12

Ole Tange



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!