Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy string matching and regex

I have a vector of sentences such as:

example <- c("text text word1 text text word2 text text", ...)

and I'm trying to identify which sentences comply with the following rules:

  • the sentence contains both "word1" and "word2"
  • "word1" comes before "word2"
  • there are between zero and three words between "word1" and "word2"

This could be done with a normal regex. However, the problem is that "word1" or "word2" can contain typos (I am expecting at most a distance of 3 for both words together). Examples of typos could be "wrod1", "woord2", "wrd1", etc. I also want to match the sentences that contain typos for these words within the distance constraint. Therefore I was trying to use agrepl:

agrepl("(?:.*?)\\bword1\\b(?:\\s(?:\\w+\\s){0,3})\\bword2\\b(?:.*?)", example, fixed=FALSE, max=3)

However, I believe that the distance is being calculated with the whole sentence and not only with "word1" and "word2", and therefore I will almost never get any matches in this way. Any suggestions on how to fix this, or is agrepl/regex not the best tool for this problem?

like image 694
drgxfs Avatar asked Jan 18 '26 12:01

drgxfs


1 Answers

This fit for your rules, however I'm not sure what would your typos looks like. If you could show some example, it would be great.

^(?=.*word1\s+(?:\S+\s+){0,3}word2.*$).* DEMO

like image 71
zolo Avatar answered Jan 21 '26 02:01

zolo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!