Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex include special character in pattern for re.finditer

im trying to get a start and stop index number of a word inside a string using re.finditer. for most of it my pattern working fine, but for a word with special character my regex giving me an error

Problem:

I tried:

a = " we have c++ and c#"
pattern = ['c#','c++']
regex = re.compile(r'\b(' + '|'.join(pattern) + r')\b')
out = [ (m.start(0), m.end(0)) for m in regex.finditer(a)]

Current Output:

error: multiple repeat at position x

Expected Output :

[(9,12),(17,19)]

for most of case my pattern working fine but word with special character I'm having a problem. I'm not much familiar with regex, any one please help out of it, Thanks!

like image 937
sai kumar Avatar asked Dec 07 '25 05:12

sai kumar


1 Answers

Code:

a = " we have c++ and c#"
pattern = [ r'\b{}(?=\s|$)'.format(re.escape(s)) for s in ['c#','c++']]
regex = re.compile('|'.join(pattern))
[ (m.start(0), m.end(0)) for m in regex.finditer(a)]

Details:

The first problem is, special characters; you can escape special characters manually

'c\\+\\+', 'c\\#\\#']

or to simplify you can use re.escape, it would do that work for you

re.escape('c++, c##')

The second problem is, word boundaries; they won't behave the same way for special characters as they would for alphanumeric characters e.g. \bfoo\b

To quote from python docs

\b word boundary

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. For example, r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

To make this work, you can use positive lookahead assertion

r'\b{}(?=\s|$)'

It looks for a whitespace (\s) character or end of the sentence ($) after your pattern

like image 54
Anurag Wagh Avatar answered Dec 08 '25 18:12

Anurag Wagh



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!