I have a nested table structure. I am using the below code for parsing the data.
for row in table.find_all("tr")[1:][:-1]:
for td in row.find_all("td")[1:]:
dataset = td.get_text()
The problem here is when there are nested tables like in my case there are tables inside <td></td>
so these are parsed again after parsing initially as I am using find_all(tr)
and find_all(td)
. So how can I avoid parsing the nested table as it is parsed already?
Input:
<table>
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
</table>
Expected Output:
1 2
3 4
5
11 22
6
But what I am getting is:
1 2
3 4
5
11 22
11 22
6
That is, the inner table is parsed again.
Specs:
beautifulsoup4==4.6.3
Data order should be preserved and content could be anything including any alphanumeric characters.
Using a combinations of bs4 and re, you might achieve what you want.
I am using bs4 4.6.3
from bs4 import BeautifulSoup as bs
import re
html = '''
<table>
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
</table>'''
soup = bs(html, 'lxml')
ans = []
for x in soup.findAll('td'):
if x.findAll('td'):
for y in re.split('<table>.*</table>', str(x)):
ans += re.findall('\d+', y)
else:
ans.append(x.text)
print(ans)
For each td
we test if this is a nest td
. If so, we split on table and take everything and match with a regex every number.
Note this working only for two depths level, but adaptable to any depths
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With