I have a 100GB text file. The data in that file is in this format:
email||username||password_hash
I am testing on a 6GB file which I made separately by splitting the bigger file.
I am running grep to match the lines and output them.
I used grep. It is taking around 1 minute 22 seconds
I used other options with grep, like, LC_ALL=C and -F, but the time is reduced to 1 minute 15 seconds, which is still not good for a 6GB file.
Then I used ripgrep, it is taking 27 seconds on my machine, still not good.
Then I used ripgrep with -F option, it is taking 14 seconds, still not good.
I tried ag also (the silver searcher), but I found that it won't work for files bigger than 2 GB.
I need your help which command line tool (or language) to achieve better results, or some way I can take advantage of the format of data to search by column. Like if I am searching by username, then instead of matching the whole line, I search only on the second column. I tried that using awk, but it is still slower.
If you have to do this just once: Use grep and wait until it finishes.
If searching for a strings in 600GB csv files is part of your regular process then you'll have to change the process. Options are: use a database instead of a text file, use map/reduce and spread the load across multiple machines and cores (hadoop), ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With