Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex matching with optional prefix and suffix

Tags:

python

regex

I have a regular expression that matches parts of a string (specifically peptide sequences with modifications) and I want to use re.findall to get all parts of the string:

The sequence can start with an optional suffix that is anything non-capital letter string followed by -.

And the sequence can also have a prefix that starts with a - followed by a non-capitial letter string.

The rest of the sequence should be split by capital letters with an optional prefix for each.

E.g.

"foo-ABcmCD-bar" -> ['foo-','A','B','cmC','D','-bar']

"DEF" -> ['','D','E','F','']

"WHATEVER-foo" -> ['', 'W', 'H', 'A', 'T', 'E', 'V', 'E', 'R', '-foo']

"cmC-foo" -> ['', 'cmC', '-foo']

"ac-cmC-foo" -> ['ac-', 'cmC', '-foo']

What I have is:

(?:(^(?:[^A-Z]+-)?)|((?:-[^A-Z]+)?$)|((?:[^A-Z]*)?[A-Z]))

Capturing group 1 (^(?:[^A-Z]+-)?) is supposed to catch the optional prefix or an empty String. Capturing group 2 ((?:-[^A-Z]+)?$) is supposed to catch the optional suffix or an empty String. Capturing group 3 ((?:[^A-Z]*)?[A-Z]) is supposed to catch any capital character in the rest of the string that could have a substring of non-capital characters in front.

I get the optional prefix or empty string.

The suffix seems almost to work - BUT if there is a suffix the end of line is matched twice one with the suffix and ones with an empty string.

>>> re.findall(r,"foo-ABC-bar")
['foo-', 'A', 'B', 'C', '-bar', '']
>>> re.findall(r,"ABC-bar")
['', 'A', 'B', 'C', '-bar', '']
>>> re.findall(r,"ABcmC")
['', 'A', 'B', 'cmC', '']

I.e. how do I get rid of the extra empty string or why is the $ matched twice?

example: https://regex101.com/r/koZPOD/1

like image 801
Lutz Avatar asked Sep 08 '25 17:09

Lutz


2 Answers

This question already have three answers, all about how to write a regex that works, but none about how to fix your regex. Yours is almost correct: It just needs a tiny modification.

[...] why is the $ matched twice?

This is the relevant part:

(?:-[^A-Z]+)?$

As you see, you are matching an optional pattern before $, a zero-width assertion. That said, after matching -bar in foo-A-bar, the engine proceeds to the position right behind it, where it finds no -[^A-Z]+ but, again, $. Since the former is optional, this is recorded as another match.

[...] how do I get rid of the extra empty string[?]

We explicitly tell the engine to match $ iff it is preceded by either -[^A-Z]+ (i.e. has suffix), or something that is not [^A-Z] (i.e. no suffix):

(?:
  (^(?:[^A-Z]+-)?)
|
  ((?:-[^A-Z]+|(?<![^A-Z]))$)
|
  ((?:[^A-Z]*)?[A-Z])
)

Try it on regex101.com.

(regex101.com's Python flavor doesn't reflect the actual result so I used PCRE2 instead.)

Also, the outermost (?: ) and the (?: )? in (?:[^A-Z]*)? are unnecessary; you can remove them entirely. (?<![^A-Z]) can also be simplified as (?<=[A-Z]).

(^(?:[^A-Z]+-)?)
|
((?:-[^A-Z]+|(?<=[A-Z]))$)
|
([^A-Z]*[A-Z])

Try it on regex101.com.

Remove the capturing groups to make it .findall()-friendly:

pattern = re.compile(r'^(?:[^A-Z]+-)?|(?:-[^A-Z]+|(?<=[A-Z]))$|[^A-Z]*[A-Z]')

for testcase in testcases:
  print(f'{testcase!r:<16}: {pattern.findall(testcase)}')
'foo-ABcmCD-bar': ['foo-', 'A', 'B', 'cmC', 'D', '-bar']
'DEF'           : ['', 'D', 'E', 'F', '']
'WHATEVER-foo'  : ['', 'W', 'H', 'A', 'T', 'E', 'V', 'E', 'R', '-foo']
'foo-ABC-bar'   : ['foo-', 'A', 'B', 'C', '-bar']
'ABC-bar'       : ['', 'A', 'B', 'C', '-bar']
'ABcmC'         : ['', 'A', 'B', 'cmC', '']
'foo-abCD'      : ['foo-', 'abC', 'D', '']
'abCD'          : ['', 'abC', 'D', '']
like image 85
InSync Avatar answered Sep 10 '25 08:09

InSync


Maybe you could filter your regex results like this:

import re
strs = strs = ["foo-ABcmCD-bar", "DEF", "WHATEVER-foo", "foo-AB(-16)CDE-bla", 'foo-abCD', 'abCD']
for s in strs:
    result = (
        [[z for z in y if z] for y in (re.findall(r'(?:(^(?:[^A-Z]+-)?)|((?:-[^A-Z]+)?$)|((?:[^A-Z]*)?[A-Z]))', s))])
    print(f'1st step: {result=}')
    # result = result[:-1] if len(result) > 2 and result[-2] and '-' in result[-2][0] else result
    result = result[:-1] if len(result) > 2 and result[-2] and result[-2][0].startswith('-') else result # (better)
    print(f'2st step: {result=}')
    result = ['' if not x else x[0] for x in result]
    print(f'FINAL RESULT: 3rd step: {result=}')
    print('------------------')
    
1st step: result=[['foo-'], ['A'], ['B'], ['cmC'], ['D'], ['-bar'], []]
2st step: result=[['foo-'], ['A'], ['B'], ['cmC'], ['D'], ['-bar']]
FINAL RESULT: 3rd step: result=['foo-', 'A', 'B', 'cmC', 'D', '-bar']
------------------
1st step: result=[[], ['D'], ['E'], ['F'], []]
2st step: result=[[], ['D'], ['E'], ['F'], []]
FINAL RESULT: 3rd step: result=['', 'D', 'E', 'F', '']
------------------
1st step: result=[[], ['W'], ['H'], ['A'], ['T'], ['E'], ['V'], ['E'], ['R'], ['-foo'], []]
2st step: result=[[], ['W'], ['H'], ['A'], ['T'], ['E'], ['V'], ['E'], ['R'], ['-foo']]
FINAL RESULT: 3rd step: result=['', 'W', 'H', 'A', 'T', 'E', 'V', 'E', 'R', '-foo']
------------------
1st step: result=[['foo-'], ['A'], ['B'], ['(-16)C'], ['D'], ['E'], ['-bla'], []]
2st step: result=[['foo-'], ['A'], ['B'], ['(-16)C'], ['D'], ['E'], ['-bla']]
FINAL RESULT: 3rd step: result=['foo-', 'A', 'B', '(-16)C', 'D', 'E', '-bla']

As a one-liner (maybe more elegant - maybe not haha)

import re

strs = ["foo-ABcmCD-bar", "DEF", "WHATEVER-foo", "foo-AB(-16)CDE-bla", 'foo-abCD', 'abCD']
[['' if not x else x[0] for x in q ] for q in [result[:-1]  if len(result) > 2 and result[-2] and result[-2][0].startswith('-') else result for result in  [[[z for z in y if z] for y in (re.findall(r'(?:(^(?:[^A-Z]+-)?)|((?:-[^A-Z]+)?$)|((?:[^A-Z]*)?[A-Z]))', s))] for s in strs]]]


[['foo-', 'A', 'B', 'cmC', 'D', '-bar'],
 ['', 'D', 'E', 'F', ''],
 ['', 'W', 'H', 'A', 'T', 'E', 'V', 'E', 'R', '-foo'],
 ['foo-', 'A', 'B', '(-16)C', 'D', 'E', '-bla'],
 ['foo-', 'abC', 'D', ''],
 ['', 'abC', 'D', '']]
like image 39
Hans Avatar answered Sep 10 '25 10:09

Hans