I have a long, long list of headers which are followed by lists:
<h2>Header1</h2>
<ul>
<li>A</li>
<li>B</li>
<li>C</li>
</ul>
<h2>Header2</h2>
<ul>
<li>D</li>
<li>E</li>
<li>F</li>
</ul>
Et cetera. What is the most compact way of grabbing all the lists after each header using BeautifulSoup and corresponding header?
So ideally the result would be a dictionary, looking like:
{
'header1': ['A','B','C'],
'header2': ['D','E','F'],
}
You can try this for start and optimize after you get the idea.
import bs4
txt = '''\
<h2>Header1</h2>
<ul>
<li>A</li>
<li>B</li>
<li>C</li>
</ul>
<h2>Header2</h2>
<ul>
<li>D</li>
<li>E</li>
<li>F</li>
</ul>
'''
soup = bs4.BeautifulSoup(txt)
output = dict()
key = []
for _ in soup.findAll('h2'):
key.append(_.findAll(text=True)[0])
vec = [j.findAll('li') for j in soup.findAll('ul')]
for i in range(len(vec)):
output[key[i]] = []
for j in vec[i]:
output[key[i]].append(j.findAll(text=True)[0])
print(output)
Output
{'Header1': ['A', 'B', 'C'], 'Header2': ['D', 'E', 'F']}
Edited: Shorter and neater code
from bs4 import BeautifulSoup
txt = '''\
<h2>Header1</h2>
<ul>
<li>A</li>
<li>B</li>
<li>C</li>
</ul>
<h2>Header2</h2>
<ul>
<li>D</li>
<li>E</li>
<li>F</li>
</ul>
'''
soup = BeautifulSoup(txt, 'html.parser')
output = dict()
header = soup.find_all('h2')
for num in range(len(header)):
temp = header[num]
key = temp.find_all(text=True)[0]
output[key] = []
for item in (soup.find_all('ul')[num]).find_all('li'):
output[key].append(item.find_all(text=True)[0])
print(output)
Output will be the same
{'Header1': ['A', 'B', 'C'], 'Header2': ['D', 'E', 'F']}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With