I have a list of strings I want to find within a file. This would be fairly simple to accomplish if the strings in my list and in the file matched exactly. Unfortunately, there are typos and variations on the name. Here's an example of how some of these strings differ
List File
B-Arrestin Beta-Arrestin
Becn-1 BECN 1
CRM-E4 CRME4
Note that each of those pairs should count as a match despite being different strings. I know that I could categorize every kind of variation and write separate REGEX to identify matches but that is cumbersome enough that I might be better off manually looking for matches. I think the best solution for my problem would be some kind of expression that says:
"Match this string exactly but still count it as a match if there are X characters that do not match"
Does something like this exist? Is there another way to match strings that are not exactly the same but close?
As 200_success pointed out, you can do fuzzy matching with Text::Fuzzy
, which computes the Levenshtein distance between bits of text. You will have to play with what maximum Levenshtein distance you want to allow, but if you do a case-insensitive comparison, the maximum distance in your sample data is three:
use strict;
use warnings;
use 5.010;
use Text::Fuzzy;
my $max_dist = 3;
while (<DATA>) {
chomp;
my ($string1, $string2) = split ' ', $_, 2;
my $tf = Text::Fuzzy->new(lc $string1);
say "'$string1' matches '$string2'" if $tf->distance(lc $string2) <= $max_dist;
}
__DATA__
B-Arrestin Beta-Arrestin
Becn-1 BECN 1
CRM-E4 CRME4
'B-Arrestin' matches 'Beta-Arrestin'
'Becn-1' matches 'BECN 1'
'CRM-E4' matches 'CRME4'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With