I've struck a problem with a regular expression in Python (2.7.9)
I'm trying to strip out HTML <span> tags using a regex like so:
re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, re.S)
(the regex reads thusly: <span, anything that's not a >, then a >, then non-greedy-match anything, followed by a</span>, and use re.S (re.DOTALL) so . matches newline characters
This seems to work unless there is a newline in the text. It looks like re.S (DOTALL) doesn't apply within a non-greedy match.
Here's the test code; remove the newline from text1 and the re.sub works. Put it back in, and the re.sub fails. Put the newline char outside the <span> tag, and the re.sub works.
#!/usr/bin/env python
import re
text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'
print repr(text1)
text2 = re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
print repr(text2)
For comparison, I wrote a Perl script to do the same thing; the regex works as I expect here.
#!/usr/bin/perl
$text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>';
print "$text1\n";
$text1 =~ s/<span[^>]*>(.*?)<\/span>/\1/s;
print "$text1\n";
Any ideas?
Tested in Python 2.6.6 and Python 2.7.9
The 4th parameter of re.sub is a count, not a flags.
re.sub(pattern, repl, string, count=0, flags=0)¶
You need to use keyword argument to explicitly specify the flags:
re.sub(r'<span[^>]*>(.*?)</span>', r'\1', input_text, flags=re.S)
                                                      ↑↑↑↑↑↑
Otherwise, re.S will be interpreted replacement count (maximum 16 times) instead of S (or DOTALL flags):
>>> import re
>>> re.S
16
>>> text1 = '<body id="aa">this is a <span color="red">test\n with newline</span></body>'
>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, re.S)
'<body id="aa">this is a <span color="red">test\n with newline</span></body>'
>>> re.sub(r'<span[^>]*>(.*?)</span>', r'\1', text1, flags=re.S)
'<body id="aa">this is a test\n with newline</body>'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With