What options are there for making word counts on very large files?
I believe the whole file is on 1 line, which may be part of the problem as pointed out in one of the answers below.
In this case, I have an xml file of 1.7 Gb and trying to count some things inside it quickly.
I found this post Count number of occurrences of a pattern in a file (even on same line) and the approach works for me up to a certain size.
Up to 300Mb or so (40 000) occurrences was fine doing
cat file.xml | grep -o xmltag | wc -l
but above that size, I get "memory exhausted".
How many newlines are in your file.xml? If one of your lines is extremely long, that might explain why grep fails with "grep: memory exhausted".
A solution to that is to introduce \n at places, where it does not matter. Say, before every </:
cat big.xml | perl -e 'while(sysread(STDIN,$buf, 32768)){ $buf=~s:</:\n</:; syswrite(STDOUT,$buf); }'
GNU Parallel can chop the big file into smaller chunks. Again you will need to find good chopping places that are not in the middle of a match. For XML a good place will often be between > and <:
parallel -a big.xml --pipepart --recend '>' --recstart '<' --block 10M grep -o xmltag
Even better are end tags that represents end of a record:
parallel -a big.xml --pipepart --recend '</endrecord>' --block 10M grep -o xmltag
Note that --pipepart is a relatively new option, so you need version 20140622 or later.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With