I am aware that definition of word boundary is (?<!\w)(?=\w)|(?<=\w)(?!\w)
and i wish to add underscore(optionally) too in definition of word boundary.
The one way of doing it is we can simply modify the definition
like the new one would be (_)?((?<!\w)(?=\w)|(?<=\w)(?!\w))
, but don't wish to use too long expression.
Easy Approach can be
If i can write word boundary inside character class, then adding underscore inside character class would be very easy just like [\b-], but the problem is that putting \b inside character class i.e. [\b], means back space character not word boundary.
please tell the solution i.e. how to put \b inside character class without losing its original meaning.
You may use lookarounds:
(?:\b|(?<=_))word(?=\b|_)
^^^^^^^^^^^^^ ^^^^^^^
See the regex demo where (?:\b|(?<=_)) is a non-capturing group matching either a word boundary or a location preceded with _, and (?=\b|_) is a positive lookahead matching either a word boundary or a _ symbol.
Unfortunately, Python re won't allow using (?<=\b|_) as the lookbehind pattern should be of fixed width (else, you will get look-behind requires fixed-width pattern error).
A Python demo:
import re
rx = r"(?:\b|(?<=_))word(?=\b|_)"
s = "some_word_here and a word there"
print(re.findall(rx,s))
An alternative solution is to use custom word boundaries like (?<![^\W_]) / (?![^\W_]) (see online demo):
rx = r"(?<![^\W_])word(?![^\W_])"
The (?<![^\W_]) negative lookbehind fails a match if there is no character other than non-word and _ char (so, it requires the start of string or any word char excluding _ before the search word) and (?![^\W_]) negative lookahead will fail the match if there is no char other than non-word and _ char (that is, requires the end of string or a word char excluding _).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With