I am having the following sample code where i am trying to match all word instances which are starting and ending with an underscore (either single or double one). <pre class="prettyprint"><code>import re test = ['abc text_ abc', 'abc _text abc', 'abc text_textUnderscored abc', 'abc :_text abc', 'abc _text_ abc', 'abc __text__ abc', 'abc _text_: abc', 'abc (-_-) abc'] test_str = ' '.join(test) print(re.compile('(_\\w+\\b)').split(test_str)) </code></pre> I have already tried the following regex and it seems too strong (should match only <code>_text_</code>and <code>__text__</code>). Output: <code>['abc text_ abc abc ', '_text', ' abc abc text', '_textUnderscored', ' abc abc :', '_text', ' abc abc ', '_text_', ' abc abc ', '__text__', ' abc abc ', '_text_', ': abc abc (-_-) abc']</code> Can you suggest a better approach (preferably with single regex pattern and usage of <code>re.split</code> method)?

If you mean to match any chunks of word chars (letters, digits and underscores) that are not preceded nor followed with non-word chars (chars other than letters, digits and underscores) and of any length (even 1, <code>_</code>) you may use <pre class="prettyprint"><code>r'\b_(?:\w*_)?\b' </code></pre> with <code>re.findall</code>. See the regex demo. If you do not want to match single-char words (i.e. <code>_</code>) you need to remove the optional non-capturing group, and use <code>r'\b_\w*_\b'</code>. If you need to match at least 3 char words, also replace <code>*</code> (zero or more repetitions) with <code>+</code> (one or more occurrences) . If you consider words as whole words only when they are at the start/end of string or are followed/preceded with whitespaces, replace <code>\b...\b</code> with <code>(?<!\S)...(?!\S)</code>: <pre class="prettyprint"><code>r'(?<!\S)_\w*_(?!\S)' </code></pre> See another regex demo Details <ul> <li> <code>\b</code> - a word boundary, there must be start of string or a non-word char right before</li> <li> <code>_</code> - an underscore</li> <li> <code>(?:\w*_)?</code> - an optional non-capturing group matching 1 or 0 occurrences of <ul> <li> <code>\w*</code> - 0+ word chars (letters, digits, <code>_</code>s) (thanks to this optional group, even <code>_</code> word will be found)</li> <li> <code>_</code> - an underscore </li> </ul> </li> <li> <code>\b</code> - a word boundary, there must be end of string or a non-word char right after</li> <li> <code>(?<!\S)</code> - left whitespace boundary</li> <li> <code>(?!\S)</code> - right whitespace boundary</li> </ul> See the Python demo: <pre class="prettyprint"><code>rx = re.compile(r'\b_(?:\w*_)?\b') print(rx.findall(test_str)) # => ['_text_', '__text__'] </code></pre>

Regex to match words both starting and ending with underscore with Python 3

Tags:

regex

python-3.x

I am having the following sample code where i am trying to match all word instances which are starting and ending with an underscore (either single or double one).

import re
test = ['abc text_ abc',
'abc _text abc',
'abc text_textUnderscored abc',
'abc :_text abc', 
'abc _text_ abc', 
'abc __text__ abc',
'abc _text_: abc',
'abc (-_-) abc']
test_str = ' '.join(test)
print(re.compile('(_\\w+\\b)').split(test_str))

I have already tried the following regex and it seems too strong (should match only _text_and __text__).

Output: ['abc text_ abc abc ', '_text', ' abc abc text', '_textUnderscored', ' abc abc :', '_text', ' abc abc ', '_text_', ' abc abc ', '__text__', ' abc abc ', '_text_', ': abc abc (-_-) abc']

Can you suggest a better approach (preferably with single regex pattern and usage of re.split method)?

443

asked Mar 05 '19 20:03

azawalich

1 Answers

If you mean to match any chunks of word chars (letters, digits and underscores) that are not preceded nor followed with non-word chars (chars other than letters, digits and underscores) and of any length (even 1, _) you may use

r'\b_(?:\w*_)?\b'

with re.findall. See the regex demo.

If you do not want to match single-char words (i.e. _) you need to remove the optional non-capturing group, and use r'\b_\w*_\b'.

If you need to match at least 3 char words, also replace * (zero or more repetitions) with + (one or more occurrences) .

If you consider words as whole words only when they are at the start/end of string or are followed/preceded with whitespaces, replace \b...\b with (?<!\S)...(?!\S):

r'(?<!\S)_\w*_(?!\S)'

See another regex demo

Details

\b - a word boundary, there must be start of string or a non-word char right before
_ - an underscore
(?:\w*_)? - an optional non-capturing group matching 1 or 0 occurrences of
- \w* - 0+ word chars (letters, digits, _s) (thanks to this optional group, even _ word will be found)
- _ - an underscore
\b - a word boundary, there must be end of string or a non-word char right after
(?<!\S) - left whitespace boundary
(?!\S) - right whitespace boundary

See the Python demo:

rx = re.compile(r'\b_(?:\w*_)?\b')
print(rx.findall(test_str))
# => ['_text_', '__text__']

answered Oct 03 '22 13:10

Wiktor Stribiżew

Related questions
                            
                                SyntaxError: Invalid regular expression: nothing to repeat
                            
                                What is the regex that properly splits SVG 'd' attributes into tokens?
                            
                                How to use php to convert odd characters to single characters from iPhone input
                            
                                how to use sed delete Unicode in some range?
                            
                                Extract part of string according to pattern using regular expression Python
                            
                                Odd issue with .isin() and strings (Python/Pandas)
                            
                                Replace letters in a template string in R
                            
                                How to return string representation of re.search() in python
                            
                                What is an "illegal primary" in awk?
                            
                                Replace outer single quotes (') to double (") and inner double (") to single qoutes (')
                            
                                Remove Hebrew vowels (nikkud) from selected Unicode Hebrew text
                            
                                Can I grep a certain amount of lines before or after a string I want? [duplicate]
                            
                                Usernames that cannot start or end with characters
                            
                                Why won't this regex work in SublimeText when it does in Vim?
                            
                                Not being able to detect '-' character in regular expression [duplicate]
                            
                                Trying to find a large string between a start point and end point using regex
                            
                                How to write one regular expression to meet all cases and print specified variable
                            
                                How can I find repeated characters with a regex in Java?
                            
                                .htaccess RewriteRule to path without changing URL
                            
                                Why does re.sub replace the entire pattern, not just a capturing group within it?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With