Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to stop python Regular Expression being too greedy

Tags:

python

regex

I'm trying to match (in Python) the show name and season/episode numbers from tv episode filenames in the format:

Show.One.S01E05.720p.HDTV.x264-CTU.mkv

and

Show.Two.S08E02.HDTV.XviD-LOL.avi

My regular expression:

(?P<show>[\w\s.,_-]+)\.[Ss]?(?P<season>[\d]{1,2})[XxEe]?(?P<episode>[\d]{2})

matches correctly on Show Two giving me Show Two, 08 and 02. However the 720 in Show One means I get back 7 and 20 for season/episode.

If I remove the ? after [XxEe] then it matches both types but I want that range to be optional for filenames where the episode identifier isn't included.

I've tried using ?? to stop the [XxEe] match being greedy as listed in the python docs re module section but this has no effect.

How can I capture the series name section and the season/episode section while ignoring the rest of the string?

like image 829
ghickman Avatar asked Sep 03 '25 14:09

ghickman


2 Answers

Change the greedity on first match:

 p=re.compile('(?P<show>[\w\s.,_-]+?)\.[Ss]?(?P<season>[\d]{1,2})[XxEe]?(?P<episode>[\d]{2})')
 print p.findall("Game.of.Thrones.S01E05.720p.HDTV.x264-CTU.mkv")
 [('Game.of.Thrones', '01', '05')]
 print p.findall("Entourage.S08E02.HDTV.XviD-LOL.avi")
 [('Entourage', '08', '02')]

Note the ? following + in first group.

Explanation :

First match eats too much, so reducing its greedity makes the following match sooner. (not a really nice example by the way, I would have changed names as they definitely sound a bit too Warezzz-y to be honest ;-) )

like image 127
Bruce Avatar answered Sep 05 '25 06:09

Bruce


Try:

                    v
(?P<show>[\w\s.,_-]+?)\.[Ss]?(?P<season>[\d]{1,2})[XxEe]?(?P<episode>[\d]{2})
like image 43
pricco Avatar answered Sep 05 '25 06:09

pricco