I am having the following sample code where i am trying to match all word instances which are starting and ending with an underscore (either single or double one).
import re
test = ['abc text_ abc',
'abc _text abc',
'abc text_textUnderscored abc',
'abc :_text abc',
'abc _text_ abc',
'abc __text__ abc',
'abc _text_: abc',
'abc (-_-) abc']
test_str = ' '.join(test)
print(re.compile('(_\\w+\\b)').split(test_str))
I have already tried the following regex and it seems too strong (should match only _text_and __text__).
Output: ['abc text_ abc abc ', '_text', ' abc abc text', '_textUnderscored', ' abc abc :', '_text', ' abc abc ', '_text_', ' abc abc ', '__text__', ' abc abc ', '_text_', ': abc abc (-_-) abc']
Can you suggest a better approach (preferably with single regex pattern and usage of re.split method)?
The _ (underscore) character in the regular expression means that the zone name must have an underscore immediately following the alphanumeric string matched by the preceding brackets. The . (period) matches any character (a wildcard).
Inside a character range, \b represents the backspace character, for compatibility with Python's string literals. Matches the empty string, but only when it is not at the beginning or end of a word.
Regex doesn't recognize underscore as special character.
If you mean to match any chunks of word chars (letters, digits and underscores) that are not preceded nor followed with non-word chars (chars other than letters, digits and underscores) and of any length (even 1, _) you may use
r'\b_(?:\w*_)?\b'
with re.findall. See the regex demo.
If you do not want to match single-char words (i.e. _) you need to remove the optional non-capturing group, and use r'\b_\w*_\b'.
If you need to match at least 3 char words, also replace * (zero or more repetitions) with + (one or more occurrences) .
If you consider words as whole words only when they are at the start/end of string or are followed/preceded with whitespaces, replace \b...\b with (?<!\S)...(?!\S):
r'(?<!\S)_\w*_(?!\S)'
See another regex demo
Details
\b - a word boundary, there must be start of string or a non-word char right before_ - an underscore(?:\w*_)? - an optional non-capturing group matching 1 or 0 occurrences of
\w* - 0+ word chars (letters, digits, _s) (thanks to this optional group, even _ word will be found)_ - an underscore \b - a word boundary, there must be end of string or a non-word char right after(?<!\S) - left whitespace boundary(?!\S) - right whitespace boundarySee the Python demo:
rx = re.compile(r'\b_(?:\w*_)?\b')
print(rx.findall(test_str))
# => ['_text_', '__text__']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With