Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using BeautifulSoup to grab all the HTML between two tags

I have some HTML that looks like this:

<h1>Title</h1>

//a random amount of p/uls or tagless text

<h1> Next Title</h1>

I want to copy all of the HTML from the first h1, to the next h1. How can I do this?

like image 335
passsky Avatar asked Sep 06 '25 03:09

passsky


2 Answers

This is the clear BeautifulSoup way, when the second h1 tag is a sibling of the first:

html = u""
for tag in soup.find("h1").next_siblings:
    if tag.name == "h1":
        break
    else:
        html += unicode(tag)
like image 162
Hunting Avatar answered Sep 07 '25 20:09

Hunting


I have the same problem. Not sure if there is a better solution, but what I've done is use regular expressions to get the indices of the two nodes that I'm looking for. Once I have that, I extract the HTML between the two indexes and create a new BeautifulSoup object.

Example:

m = re.search(r'<h1>Title</h1>.*?<h1>', html, re.DOTALL)
s = m.start()
e = m.end() - len('<h1>')
target_html = html[s:e]
new_bs = BeautifulSoup(target_html)
like image 20
maltman Avatar answered Sep 07 '25 21:09

maltman