I'm using BeautifulSoup 4 with Python 3.7. I have the following HTML ...
<tr>
<td class="info"><div class="title">...</div></td>
</tr>
<tr class="ls">
<td colspan="3">Less similar results</td>
</tr>
<tr>
<td class="info"><div class="title">...</div></td>
</tr>
I would like to extract the DIVs with class="title", however, I only want to find the ones that occur before the element in the table whose TD text = "Less similar results". Right now I have this
elts = soup.find("td", class_="info").find_all("div", class_="title")
But this returns all DIVs with that class, even ones that have occurred after the element I want to screen for. How do I refine my search to only include results before that particualr TD?
You can use CSS selector tr:not(tr:has(td:contains("Less similar results")) ~ *) div.title:
data = '''<tr>
<td class="info"><div class="title">THIS YOU WANT ...</div></td>
</tr>
<tr class="ls">
<td colspan="3">Less similar results</td>
</tr>
<tr>
<td class="info"><div class="title">THIS YOU DON'T WANT ...</div></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select('tr:not(tr:has(td:contains("Less similar results")) ~ *) div.title'))
Prints:
[<div class="title">THIS YOU WANT ...</div>]
What does it mean?
tr:not(tr:has(td:contains("Less similar results")) ~ *) div.title
Select <div> with class title, that is under <tr> which comes before <tr> that contains <td> with "Less similar results".
Further reading:
CSS Selector Reference
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With