I have a regular expression that matches parts of a string (specifically peptide sequences with modifications) and I want to use re.findall to get all parts of the string:
The sequence can start with an optional suffix that is anything non-capital letter string followed by -
.
And the sequence can also have a prefix that starts with a -
followed by a non-capitial letter string.
The rest of the sequence should be split by capital letters with an optional prefix for each.
E.g.
"foo-ABcmCD-bar"
-> ['foo-','A','B','cmC','D','-bar']
"DEF"
-> ['','D','E','F','']
"WHATEVER-foo"
-> ['', 'W', 'H', 'A', 'T', 'E', 'V', 'E', 'R', '-foo']
"cmC-foo"
-> ['', 'cmC', '-foo']
"ac-cmC-foo"
-> ['ac-', 'cmC', '-foo']
What I have is:
(?:(^(?:[^A-Z]+-)?)|((?:-[^A-Z]+)?$)|((?:[^A-Z]*)?[A-Z]))
Capturing group 1 (^(?:[^A-Z]+-)?)
is supposed to catch the optional prefix or an empty String.
Capturing group 2 ((?:-[^A-Z]+)?$)
is supposed to catch the optional suffix or an empty String.
Capturing group 3 ((?:[^A-Z]*)?[A-Z])
is supposed to catch any capital character in the rest of the string that could have a substring of non-capital characters in front.
I get the optional prefix or empty string.
The suffix seems almost to work - BUT if there is a suffix the end of line is matched twice one with the suffix and ones with an empty string.
>>> re.findall(r,"foo-ABC-bar")
['foo-', 'A', 'B', 'C', '-bar', '']
>>> re.findall(r,"ABC-bar")
['', 'A', 'B', 'C', '-bar', '']
>>> re.findall(r,"ABcmC")
['', 'A', 'B', 'cmC', '']
I.e. how do I get rid of the extra empty string or why is the $ matched twice?
example: https://regex101.com/r/koZPOD/1
This question already have three answers, all about how to write a regex that works, but none about how to fix your regex. Yours is almost correct: It just needs a tiny modification.
[...] why is the
$
matched twice?
This is the relevant part:
(?:-[^A-Z]+)?$
As you see, you are matching an optional pattern before $
, a zero-width assertion. That said, after matching -bar
in foo-A-bar
, the engine proceeds to the position right behind it, where it finds no -[^A-Z]+
but, again, $
. Since the former is optional, this is recorded as another match.
[...] how do I get rid of the extra empty string[?]
We explicitly tell the engine to match $
iff it is preceded by either -[^A-Z]+
(i.e. has suffix), or something that is not [^A-Z]
(i.e. no suffix):
(?:
(^(?:[^A-Z]+-)?)
|
((?:-[^A-Z]+|(?<![^A-Z]))$)
|
((?:[^A-Z]*)?[A-Z])
)
Try it on regex101.com.
(regex101.com's Python flavor doesn't reflect the actual result so I used PCRE2 instead.)
Also, the outermost (?:
)
and the (?:
)?
in (?:[^A-Z]*)?
are unnecessary; you can remove them entirely. (?<![^A-Z])
can also be simplified as (?<=[A-Z])
.
(^(?:[^A-Z]+-)?)
|
((?:-[^A-Z]+|(?<=[A-Z]))$)
|
([^A-Z]*[A-Z])
Try it on regex101.com.
Remove the capturing groups to make it .findall()
-friendly:
pattern = re.compile(r'^(?:[^A-Z]+-)?|(?:-[^A-Z]+|(?<=[A-Z]))$|[^A-Z]*[A-Z]')
for testcase in testcases:
print(f'{testcase!r:<16}: {pattern.findall(testcase)}')
'foo-ABcmCD-bar': ['foo-', 'A', 'B', 'cmC', 'D', '-bar']
'DEF' : ['', 'D', 'E', 'F', '']
'WHATEVER-foo' : ['', 'W', 'H', 'A', 'T', 'E', 'V', 'E', 'R', '-foo']
'foo-ABC-bar' : ['foo-', 'A', 'B', 'C', '-bar']
'ABC-bar' : ['', 'A', 'B', 'C', '-bar']
'ABcmC' : ['', 'A', 'B', 'cmC', '']
'foo-abCD' : ['foo-', 'abC', 'D', '']
'abCD' : ['', 'abC', 'D', '']
Maybe you could filter your regex results like this:
import re
strs = strs = ["foo-ABcmCD-bar", "DEF", "WHATEVER-foo", "foo-AB(-16)CDE-bla", 'foo-abCD', 'abCD']
for s in strs:
result = (
[[z for z in y if z] for y in (re.findall(r'(?:(^(?:[^A-Z]+-)?)|((?:-[^A-Z]+)?$)|((?:[^A-Z]*)?[A-Z]))', s))])
print(f'1st step: {result=}')
# result = result[:-1] if len(result) > 2 and result[-2] and '-' in result[-2][0] else result
result = result[:-1] if len(result) > 2 and result[-2] and result[-2][0].startswith('-') else result # (better)
print(f'2st step: {result=}')
result = ['' if not x else x[0] for x in result]
print(f'FINAL RESULT: 3rd step: {result=}')
print('------------------')
1st step: result=[['foo-'], ['A'], ['B'], ['cmC'], ['D'], ['-bar'], []]
2st step: result=[['foo-'], ['A'], ['B'], ['cmC'], ['D'], ['-bar']]
FINAL RESULT: 3rd step: result=['foo-', 'A', 'B', 'cmC', 'D', '-bar']
------------------
1st step: result=[[], ['D'], ['E'], ['F'], []]
2st step: result=[[], ['D'], ['E'], ['F'], []]
FINAL RESULT: 3rd step: result=['', 'D', 'E', 'F', '']
------------------
1st step: result=[[], ['W'], ['H'], ['A'], ['T'], ['E'], ['V'], ['E'], ['R'], ['-foo'], []]
2st step: result=[[], ['W'], ['H'], ['A'], ['T'], ['E'], ['V'], ['E'], ['R'], ['-foo']]
FINAL RESULT: 3rd step: result=['', 'W', 'H', 'A', 'T', 'E', 'V', 'E', 'R', '-foo']
------------------
1st step: result=[['foo-'], ['A'], ['B'], ['(-16)C'], ['D'], ['E'], ['-bla'], []]
2st step: result=[['foo-'], ['A'], ['B'], ['(-16)C'], ['D'], ['E'], ['-bla']]
FINAL RESULT: 3rd step: result=['foo-', 'A', 'B', '(-16)C', 'D', 'E', '-bla']
As a one-liner (maybe more elegant - maybe not haha)
import re
strs = ["foo-ABcmCD-bar", "DEF", "WHATEVER-foo", "foo-AB(-16)CDE-bla", 'foo-abCD', 'abCD']
[['' if not x else x[0] for x in q ] for q in [result[:-1] if len(result) > 2 and result[-2] and result[-2][0].startswith('-') else result for result in [[[z for z in y if z] for y in (re.findall(r'(?:(^(?:[^A-Z]+-)?)|((?:-[^A-Z]+)?$)|((?:[^A-Z]*)?[A-Z]))', s))] for s in strs]]]
[['foo-', 'A', 'B', 'cmC', 'D', '-bar'],
['', 'D', 'E', 'F', ''],
['', 'W', 'H', 'A', 'T', 'E', 'V', 'E', 'R', '-foo'],
['foo-', 'A', 'B', '(-16)C', 'D', 'E', '-bla'],
['foo-', 'abC', 'D', ''],
['', 'abC', 'D', '']]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With