Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does re.sub('.*?', '-', 'abc') return '-a-b-c-' instead of '-------'?

Tags:

python

regex

This is the results from python2.7.

>>> re.sub('.*?', '-', 'abc')
'-a-b-c-'

The results I thought should be as follows.

>>> re.sub('.*?', '-', 'abc')
'-------'

But it's not. Why?

like image 450
Daniel Avatar asked Dec 04 '25 17:12

Daniel


1 Answers

The best explanation of this behaviour I know of is from the regex PyPI package, which is intended to eventually replace re (although it has been this way for a long time now).

Sometimes it’s not clear how zero-width matches should be handled. For example, should .* match 0 characters directly after matching >0 characters?

Most regex implementations follow the lead of Perl (PCRE), but the re module sometimes doesn’t. The Perl behaviour appears to be the most common (and the re module is sometimes definitely wrong), so in version 1 the regex module follows the Perl behaviour, whereas in version 0 it follows the legacy re behaviour.

Examples:

# Version 0 behaviour (like re)
>>> regex.sub('(?V0).*', 'x', 'test')
'x'
>>> regex.sub('(?V0).*?', '|', 'test')
'|t|e|s|t|'

# Version 1 behaviour (like Perl)
>>> regex.sub('(?V1).*', 'x', 'test')
'xx'
>>> regex.sub('(?V1).*?', '|', 'test')
'|||||||||'

(?VX) sets the version flag in the regex. The second example is what you expect, and is supposedly what PCRE does. Python's re is somewhat nonstandard, and is kept as it is probably solely due to backwards compatibility concerns. I've found an example of something similar (with re.split).

like image 174
Veedrac Avatar answered Dec 06 '25 09:12

Veedrac



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!