I have some HTML that looks like this:
<h1>Title</h1>
//a random amount of p/uls or tagless text
<h1> Next Title</h1>
I want to copy all of the HTML from the first h1, to the next h1. How can I do this?
This is the clear BeautifulSoup way, when the second h1
tag is a sibling of the first:
html = u""
for tag in soup.find("h1").next_siblings:
if tag.name == "h1":
break
else:
html += unicode(tag)
I have the same problem. Not sure if there is a better solution, but what I've done is use regular expressions to get the indices of the two nodes that I'm looking for. Once I have that, I extract the HTML between the two indexes and create a new BeautifulSoup object.
Example:
m = re.search(r'<h1>Title</h1>.*?<h1>', html, re.DOTALL)
s = m.start()
e = m.end() - len('<h1>')
target_html = html[s:e]
new_bs = BeautifulSoup(target_html)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With