I'm trying to use python RE module to capture specific digits of strings like '03' in ' video [720P] [DHR] _sp03.mp4 '.
And what confused me is :
when I use '.*\D+(\d+).*mp4', it succeed to capture both the two digits 03 ,
but when I use '.*\D*(\d+).*mp4', it only captured the rear digit 3.
I know python uses a greedy mode as default, which means trying to match as much text as possible. Considering this, I think * and + after the \D should behave samely. So where am I wrong? What leads to this difference? Can anyone help explain it?
BTW: I used online regex tester for python: https://regex101.com/#python
What makes the difference is not the \D+ but the first .*
Now in regex .* is greedy and tries to match as much as characters as possible as it can
So when you write
.*\D*(\d+).*mp4
The .* will match as much as it can. That is if we try to break it down, it would look like
video [720P] [DHR] _sp03.mp4
|
.*
video [720P] [DHR] _sp03.mp4
|
.*
.....
video [720P] [DHR] _sp03.mp4
|
.* That is 0 is also matched by the .
video [720P] [DHR] _sp03.mp4
|
\D* Since the quantfier is zero or more, it matches nothing here without advancing to 3
video [720P] [DHR] _sp03.mp4
|
(\d+)
video [720P] [DHR] _sp03.mp4
|
.*
video [720P] [DHR] _sp03.mp4
|
mp4
Now when we use the \D+, the matching changes a bit, because the regex engine will be forced to match at least 1 non digit(\D+) before the digits ((\d+)). This will be consume the p which is the last non digit before the digits
That is
.* will try to match as much as it can till p, so that the \D+ can match at least one non digit which is p and \d+ will match you the 03 part
video [720P] [DHR] _sp03.mp4
|
.*
video [720P] [DHR] _sp03.mp4
|
.*
.....
video [720P] [DHR] _sp03.mp4
|
\D+ The first non digit. Forced to match at least once.
video [720P] [DHR] _sp03.mp4
|
(\d+)
video [720P] [DHR] _sp03.mp4
|
(\d+)
video [720P] [DHR] _sp03.mp4
|
.*
video [720P] [DHR] _sp03.mp4
|
mp4
The problem is with \D*. The '+' is for one or more and '*' is for zero or more.
As you have used '.*' in starting it become greedy and takes till ' video [720P] [DHR] _sp0' where in '\D+' case it quits at ' video [720P] [DHR] _s' leaving 'p' for \D+
>>> import re
>>> a = " video [720P] [DHR] _sp03.mp4 "
>>> p1 = re.compile('.*\D+(\d+).*mp4')
>>> p2 = re.compile('.*\D*(\d+).*mp4')
>>> re.findall(p1,a)
['03']
>>> re.findall(p2,a)
['3']
>>> a
' video [720P] [DHR] _sp03.mp4 '
>>> p3 = re.compile('(.*)(\D*)(\d+)(.*)mp4')
>>> re.findall(p3,a)
[(' video [720P] [DHR] _sp0', '', '3', '.')]
>>> p4 = re.compile('(.*)(\D+)(\d+)(.*)mp4')
>>> re.findall(p4,a)
[(' video [720P] [DHR] _s', 'p', '03', '.')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With