Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

repetition in regular expression in python

I've got a file with lines for example:

aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj

I need to take what is inside $$ so expected result is:

 $bb$
 $ddd$
 $ggg$
 $iii$

My result:

$bb$
$ggg$

My solution:

m = re.search(r'$(.*?)$', line)
    if m is not None:
        print m.group(0)

Any ideas how to improve my regexp? I was trying with * and + sign, but I'm not sure how to finally create it. I was searching for similar post, but couldnt find it :(

like image 769
degath Avatar asked Sep 05 '25 03:09

degath


2 Answers

You can use re.findall with r'\$[^$]+\$' regex:

import re
line = """aaa$bb$ccc$ddd$eee
fff$ggg$hh$iii$jj"""
m = re.findall(r'\$[^$]+\$', line)
print(m)
# => ['$bb$', '$ddd$', '$ggg$', '$iii$']

See Python demo

Note that you need to escape $s and remove the capturing group for the re.findall to return the $...$ substrings, not just what is inside $s.

Pattern details:

  • \$ - a dollar symbol (literal)
  • [^$]+ - 1 or more symbols other than $
  • \$ - a literal dollar symbol.

NOTE: The [^$] is a negated character class that matches any char but the one(s) defined in the class. Using a negated character class here speeds up matching since .*? lazy dot pattern expands at each position in the string between two $s, thus taking many more steps to complete and return a match.

And a variation of the pattern to get only the texts inside $...$s:

re.findall(r'\$([^$]+)\$', line) 
               ^     ^

See another Python demo. Note the (...) capturing group added so that re.findall could only return what is captured, and not what is matched.

like image 160
Wiktor Stribiżew Avatar answered Sep 07 '25 19:09

Wiktor Stribiżew


re.search finds only the first match. Perhaps you'd want re.findall, which returns list of strings, or re.finditer that returns iterator of match objects. Additionally, you must escape $ to \$, as unescaped $ means "end of line".


Example:

>>> re.findall(r'\$.*?\$', 'aaa$bb$ccc$ddd$eee')
['$bb$', '$ddd$']
>>> re.findall(r'\$(.*?)\$', 'aaa$bb$ccc$ddd$eee')
['bb', 'ddd']

One more improvement would be to use [^$]* instead of .*?; the former means "zero or more any characters besides $; this can potentially avoid more pathological backtracking behaviour.