I am interested in removing all occurrences of a pattern in a Python string where the pattern looks like "start-string blah, blah, blah end-string". This is a general problem I'd like to be able to handle. This is the same problem as How can I remove a portion of text from a string whenever it starts with &*( and ends with )(* but in Python and not Java.
How would I solve the same problem in Python?
Assume the string looks like this,
'Bla bla bla <mark asd asd asd /> bla bla bla. Yadda yadda yadda <mark alls lkja /> yadda.'
The start of the block to remove is <mark and the end is />. So I do the following:
import re
mystring = "Bla bla bla <mark asd asd asd /> bla bla bla. Yadda yadda yadda <mark akls lkja /> yadda."
tags = "<mark", "/>"
re.sub('%s.*%s' % tags, '', mystring)
My desired output is
'Bla bla bla bla bla bla. Yadda yadda yadda yadda.'
But what I get is
'Bla bla bla yadda.'
So clearly the command is using the first instance of the opening string and the last occurrence of the end string.
How do I make it match the pattern twice and give me the desired output? This has to be easy but despite searches on "remove multiple occurrences regex Python" and the like I have not found an answer. Thanks.
You basically want to find anything between '<mark' and '/>' so you start with the pattern
r'<mark .* />'
However the .* will be greedy, so to make it non-greedy you need to add a ?, then simply use re.sub to replace those matches with empty string
>>> re.sub(r'<mark .*? />', '', s)
'Bla bla bla bla bla bla. Yadda yadda yadda yadda.'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With