Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can't I match the last part of my regular expression in python?

Tags:

python

regex

I want to match a sentence with an optional end 'other (\\w+)'. For example, the regular expression should match both sentence as follows and extract the word 'things':

  • The apple and other things.
  • The apple is big.

I wrote a regular expression as below. However, I got a result (None,). If I remove the last ?. I will get the right answer. Why?

>>> re.search('\w+(?: other (\\w+))?', 'A and other things').groups()
(None,)
>>> re.search('\w+(?: other (\\w+))', 'A and other things').groups()
('things',)
like image 753
Yyao Avatar asked Dec 27 '25 23:12

Yyao


2 Answers

If you use:

re.search(r'\w+(?: other (\w+))?', 'A and other things').group()

You will see what is happening. Since anything after \w+ is optional your search matches first word A.

As per official documentation:

.groups()

Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern.

And your search call doesn't return any subgroup hence you get:

re.search(r'\w+(?: other (\w+))?', 'A and other things').groups()
(None,)

To solve your problem you can use this alternation based regex:

r'\w+(?: other (\w+)|$)'

Examples:

>>> re.search(r'\w+(?: other (\w+)|$)', 'A and other things').group()
'and'
>>> re.search(r'\w+(?: other (\w+)|$)', 'The apple is big').group()
'big'
like image 79
anubhava Avatar answered Dec 30 '25 17:12

anubhava


The rule for regular expression searches is that they produce the leftmost longest match. Yes, it tries to give you longer matches if possible, but most importantly, when it finds the first successful match, it will stop looking further.

In the first regular expression, the leftmost point where \w+ matches is A. The optional portion doesn't match there, so it's done.

In the second regular expression, the parenthesized expression is mandatory, so A is not a match. Therefore, it continues looking. The \w+ matches and, then the second \\w+ matches things.


Note that for regular expressions in Python, especially those containing backslashes, it's a good idea to write them using r'raw strings'.

like image 26
200_success Avatar answered Dec 30 '25 16:12

200_success



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!