file.xml is a large 74G file, I have to grep a single regular expression against it as fast as possible. I'm trying to do this by using GNU parallel:
parallel --pipe --block 10M --ungroup LC_ALL=C grep -iF "test.*pattern" < file.xml
How can I implement this by using --pipepart since it's faster than --pipe?
Does it get faster by increasing or decreasing size of blocks (example 20M instead of 10M, or 10M instead of 20M)?
1.) The largest xml file I have is 11G so YMMV but using parallel --pipepart LC_ALL=C grep -H -n 'searchterm' {} :::: file.xml was faster than parallel --pipe --block 10M --ungroup LC_ALL=C grep -iF "test.*pattern" < file.xml and significantly faster than grep "searchterm" file.xml.
2.) I didn't specify a block size for the parallel --pipepart command above, but you can with the --block option; you'll need to try different block sizes yourself to see whether they speed up / slow down the search. Using --block -1 provided the fastest speed on my system for this approach.
As @tshiono mentioned in the comments, try ripgrep - this was fastest on my test xml file (quicker than grep/parallel grep/anything else) and may prove to be a better solution for you overall.
EDIT
I tested @Ole Tange's suggested 'parallel + ripgrep' approach (parallel --pipepart --block -1 LC_ALL=C rg 'Glu299SerfsTer21' {} :::: ClinVarFullRelease_00-latest.xml) and it was ~the same as rg 'Glu299SerfsTer21' ClinVarFullRelease_00-latest.xml on my system. The difference was negligible, so the 'parallel + rg' approach may be best for a very large XML file. There are a number of potential reasons I didn't see the expected speedup, eg @Gordon Davisson suggestions in his comment above, but you would need to conduct comprehensive benchmarking with your own system to figure out the best approach.
(Thanks Ole Tange for the suggestion and for creating such kick ass software)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With