I have below piece of html and need to extract only text from there between
<p>Current</p> and <p>Archive</p>
Html chunk looks like:
<p>Current</p>
<a href="some link to somewhere 1">File1</a>
<br>
<a href="some link to somewhere 2">File2</a>
<br>
<a href="some link to somewhere 3">File3</a>
<br>
<p>Archive</p>
<a href="Some another link to another file">Some another file</a>
so the desired output should looks like File1, File2, File3.
This is what I've tried so far
import re
m = re.compile('<p>Current</p>(.*?)<p>Archive</p>').search(text)
but doesn't work as expected.
Is there any simple solution how to extract text between specified chunks of html tags in python?
If you insist upon using regex you can use it in combination with list comp like so:
chunk="""<p>Current</p>
<a href="some link to somewhere 1">File1</a>
<br>
<a href="some link to somewhere 2">File2</a>
<br>
<a href="some link to somewhere 3">File3</a>
<br>
<p>Archive</p>
<a href="Some another link to another file">Some another file</a>"""
import re
# find all things between > and < the shorter the better
found = re.findall(r">(.+?)<",chunk)
# only use the stuff after "Current" before "Archive"
found[:] = found[ found.index("Current")+1:found.index("Archive")]
print(found) # python 3 syntax, remove () for python2.7
Output:
['File1', 'File2', 'File3']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With