Suppose I have the following strings:
s1=u'--FE(-)---'
s2=u'--FEM(-)---'
s3=u'--FEE(--)-'
and I want to match F,E,E,M and the content of the parentheses in different groups.
I have tried the following regular expression:
u'^.-([F])([EF]*)([E]+)[^FEM]?(M*)?(\\(.*\\))?.*$'
This expressions give the following groups and spans for the different strings:
s1 -> 'F',(2,3)   ,   '',(3,3)    ,    'E',(3,4)    ,    '',(5,5)    ,    None,(-1,-1)
s2 -> 'F',(2,3)   ,   '',(3,3)    ,    'E',(3,4)    ,    'M',(4,5)   ,    (-),(5,8)
s3 -> 'F',(2,3)   ,   'E',(3,4)   ,    'E',(4,5)    ,    '',(6,6)    ,    None,(-1,-1)
For s2, I get the wanted behaviour, a matching of the contents of the parentheses, but for s1 and s3 I don't.
How do I create a regular expression that will match the content of the parentheses even if I don't have a proper match for the group containing 'M's?
EDIT:
The answer by DWilches resolved the initial issue using the regular expression
'^.-(F)([EF]*)(E+)[^FEM]??(M*)(\(.*\)).*?$'
However, the parentheses group is also optional. The following short python script clarifies the problem:
s1=u'--FE(-)---'
s2=u'--FEM(-)--'
s3=u'--FEE(--)-'
s4=u'--FEE-M(---)--'
s5=u'--FE-M-(-)-'
s6=u'--FEM--'
s7=u'--FE-M--'
ll=[s1,s2,s3,s4,s5,s6,s7]
import re
rr1=re.compile(u'^.-(F)([EF]*)(E+)[^FEM]??(M*)[^FEM]??(\(.*\)).*?$')
rr2=re.compile(u'^.-(F)([EF]*)(E+)[^FEM]??(M*)[^FEM]??(\(.*\))?.*?$')
for s in ll:
    b=rr1.search(s)
    print s
    if b:
        print " '%s' '%s' '%s' '%s' '%s' " % (b.group(1), b.group(2), b.group(3),     b.group(4), b.group(5))
    else:
        print 'No match'
    print '######'
For rr1, the output is:
--FE(-)---
 'F' '' 'E' '' '(-)' 
######
--FEM(-)--
 'F' '' 'E' 'M' '(-)' 
######
--FEE(--)-
 'F' 'E' 'E' '' '(--)' 
######
--FEE-M(---)--
 'F' 'E' 'E' 'M' '(---)' 
######
--FE-M-(-)-
 'F' '' 'E' 'M' '(-)' 
######
--FEM--
No match
######
--FE-M--
No match
######
It is OK for the first 5 strings, but not for the two last, since it requires the parentheses.
The rr2, however, adding ? to (\(.*\)), yields the following output:
--FE(-)---
 'F' '' 'E' '' '(-)' 
######
--FEM(-)--
 'F' '' 'E' 'M' '(-)' 
######
--FEE(--)-
 'F' 'E' 'E' '' '(--)' 
######
--FEE-M(---)--
 'F' 'E' 'E' '' 'None' 
######
--FE-M-(-)-
 'F' '' 'E' '' 'None' 
######
--FEM--
 'F' '' 'E' 'M' 'None' 
######
--FE-M--
 'F' '' 'E' '' 'None' 
######
This is ok for s1,s2,s3 and s6.
Some modification is needed to yield the desired output: getting the M if it exists and the content of the parentheses if the parentheses exist.
It seems you need to use non-greedy operators:
^.-(F)([EF]*)(E+)[^FEM]??(M*)(\\(.*\\))?.*?$
Note that at the last of the last .* I added a ?. And I also changed [^FEM]? for [^FEM]??.
In the first of your samples the problem was that that last .* was eating up this: -) while your [^FEM]? was eating up this: ( ... thus not leaving anything for (\\(.*\\))?
(I also removed some square brackets around single letters, but that was more to have a shorter regex)
With this regex I obtain the following results:
--FE(-)---    ->     'F'    ''     'E'    ''     '(-)'
--FEM(-)---   ->     'F'    ''     'E'    'M'    '(-)'
--FEE(--)-    ->     'F'    'E'    'E'    ''     '(--)'
BTW: I will also remove the ? at the end of (\\(.*\\))? because even if you don't put it there, a string that don't match that part will be consumed by the following .*?.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With