Fuzzy string matching and regex

Question

I have a vector of sentences such as:

example <- c("text text word1 text text word2 text text", ...)

and I'm trying to identify which sentences comply with the following rules:

the sentence contains both "word1" and "word2"
"word1" comes before "word2"
there are between zero and three words between "word1" and "word2"

This could be done with a normal regex. However, the problem is that "word1" or "word2" can contain typos (I am expecting at most a distance of 3 for both words together). Examples of typos could be "wrod1", "woord2", "wrd1", etc. I also want to match the sentences that contain typos for these words within the distance constraint. Therefore I was trying to use agrepl:

agrepl("(?:.*?)\bword1\b(?:\s(?:\w+\s){0,3})\bword2\b(?:.*?)", example, fixed=FALSE, max=3)

However, I believe that the distance is being calculated with the whole sentence and not only with "word1" and "word2", and therefore I will almost never get any matches in this way. Any suggestions on how to fix this, or is agrepl/regex not the best tool for this problem?

zolo · Accepted Answer

This fit for your rules, however I'm not sure what would your typos looks like. If you could show some example, it would be great.

^(?=.*word1\s+(?:\S+\s+){0,3}word2.*$).* DEMO

Fuzzy string matching and regex

Tags:

string-matching

regex

r

fuzzy-comparison

drgxfs

1 Answers

zolo

Recent Activity

Donate For Us

Fuzzy string matching and regex

Tags:

string-matching

regex

r

fuzzy-comparison

drgxfs

1 Answers

zolo

Related questions

Recent Activity

Donate For Us