I have two files, example:
File1:
partial
line3
someline2
File2:
this is line3
this is partial
typo artial
someline2
someline
Requirement:
Expected result:
typo artial
someline
I tested with python but it is extremely slow. Also tested with grep and it is nearly as slow as python.
The files I am comparing can have up to 10GB in size. Memory on server side is not an issue but I would like not to waste resources.
Testing results based on answers:
Files used for testing:
Using grep:
# time grep -v -f file1 file2 > file3
real 28m50.078s
user 27m13.984s
sys 1m36.068s
# wc -l file3
1947790 file3
Grep with -F:
# time grep -v -F -f file1 file2 > file3
real 0m1.441s
user 0m1.400s
sys 0m0.040s
# wc -l file3
1950655 file3
Using perl posted by Borodin:
# time ./clean.pl > file3
real 0m2.281s
user 0m2.176s
sys 0m0.104s
# wc -l file3
1950655 file3
To be honest I did not expect fixed strings to make such a big difference for grep. So far grep wins this, will have to test with 10GB files and time it. After make sure the results are correct. Will be back with an update.
Update
Perl wins this one since I had to introduce some regex for some special cases. For instance I have a big file with domains and I want to exclude those from another file. But that means that I need domain$ as regex, otherwise google.co would match google.com and it is not ok. If you do not have that special case as I had for some files only, grep is the obvious performance winner.
I would like to use grep function on linux system
command
grep -v -f File1 File2
-v : select non-matching lines
-f : obtain PATTERN from FILE
your need run the above command on the terminal
output
typo artial
someline
The simplest way is to build a regex pattern from all of the strings in file1.txt, and print only those files in file2.txt that don't match the pattern
use strict;
use warnings 'all';
my $re = do {
open my $fh, '<', 'file1.txt' or die $!;
my @data = <$fh>;
chomp @data;
my $re = join '|', map quotemeta($_), @data;
qr/$re/;
};
open my $fh, '<', 'file2.txt' or die $!;
/$re/ or print while <$fh>;
typo artial
someline
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With