I'm trying to figure out how to detect extra characters within a spam word like:
pha.rmacy or vi*agra
any ideas?
You could use a (dis)similarity metric, such as edit distance. For instance, the edit distance between vi.agra and viagra is 1.
Then, you determine that a given word is the same as the spam word, if the edit distance between them is below a certain threshold like, say, 2.
But if you really want to use a regex, you can use something like /[^a-zA-Z0-9-\s]/ to remove punctuation from the word. But then again, you would fail to identify something like viZagra as being the same word as viagra.
Regular expressions do not seem like the appropriate tool for figuring this out. But as an attempt to answer it just because it is interesting, a simple way would be to do something like this:
/v.?i.?a.?g.?r.?a/
It would match 0 or 1 characters between each letter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With