BeautifulSoup: grab all content of all
after each header

Question

I have a long, long list of headers which are followed by lists:

<h2>Header1</h2>
<ul>
<li>A</li>
<li>B</li>
<li>C</li>
</ul>
<h2>Header2</h2>
<ul>
<li>D</li>
<li>E</li>
<li>F</li>
</ul>

Et cetera. What is the most compact way of grabbing all the lists after each header using BeautifulSoup and corresponding header?

So ideally the result would be a dictionary, looking like:

{
'header1': ['A','B','C'],
'header2': ['D','E','F'],
}

QuantStats · Accepted Answer

You can try this for start and optimize after you get the idea.

import bs4

txt = '''\
<h2>Header1</h2>
<ul>
<li>A</li>
<li>B</li>
<li>C</li>
</ul>
<h2>Header2</h2>
<ul>
<li>D</li>
<li>E</li>
<li>F</li>
</ul>
'''

soup = bs4.BeautifulSoup(txt)

output = dict()

key = []

for _ in soup.findAll('h2'):
  key.append(_.findAll(text=True)[0])

vec = [j.findAll('li') for j in soup.findAll('ul')]

for i in range(len(vec)):
  output[key[i]] = []
  for j in vec[i]:
    output[key[i]].append(j.findAll(text=True)[0])

print(output)

Output

{'Header1': ['A', 'B', 'C'], 'Header2': ['D', 'E', 'F']}

Edited: Shorter and neater code

from bs4 import BeautifulSoup

txt = '''\
<h2>Header1</h2>
<ul>
<li>A</li>
<li>B</li>
<li>C</li>
</ul>
<h2>Header2</h2>
<ul>
<li>D</li>
<li>E</li>
<li>F</li>
</ul>
'''

soup = BeautifulSoup(txt, 'html.parser')
output = dict()
header = soup.find_all('h2')

for num in range(len(header)):
  temp = header[num]
  key = temp.find_all(text=True)[0]
  output[key] = []

  for item in (soup.find_all('ul')[num]).find_all('li'):
    output[key].append(item.find_all(text=True)[0])

print(output)

Output will be the same

{'Header1': ['A', 'B', 'C'], 'Header2': ['D', 'E', 'F']}

BeautifulSoup: grab all content of all <ul> after each header

Tags:

python

beautifulsoup

David Pekker

1 Answers

QuantStats

Recent Activity

Donate For Us