Use beautifulSoup to find a table after a header?

Question

I am trying to scrape some data off a website. The data that I want is listed in a table, but there are multiple tables and no ID's. I then had the idea that I would find the header just above the table I was searching for and then use that as an indicator.

This has really troubled me, so as a last resort, I wanted to ask if there were someone who knows how to BeautifulSoup to find the table. A snipped of the HTML code is provided beneath, thanks in advance :)

The table I am interested in, is the table right beneath <h2>Mine neaste vagter</h2>

        <h2>Min aktuelle vagt</h2>
        
        
            <div>
                <a href='/shifts/detail/595212/'>Flere detaljer</a>
            <p>Vagt starter: <b>11/06 2021 - 07:00</b></p>
            <p>Vagt slutter: <b>11/06 2021 - 11:00</b></p>

            

            

            
                <h2>Masker</h2>
                <table class='list'>
                    <tr><th>Type</th><th>Fra</th><th>&nbsp;</th><th>Til</th></tr>
                    
                    <tr>
                        <td>Fri egen regningD</td>
                        <td>07:00</td>
                        <td>&nbsp;-&nbsp;</td>
                        <td>11:00</td>
                    </tr>
                    
                </table>
            
            </div>
        
    <hr>
    
    
    
    
    
    


    
    
    
    
    
    




    
        <h2>Mine neaste vagter</h2>
        <table class='list'>
            <tr>
                <th class="alignleft">Dato</th>
                <th class="alignleft">Rolle</th>
                <th class="alignleft">Tidsrum</th>
                <th></th>
                <th class="alignleft">Bytte</th>
                <th class="alignleft" colspan='2'></th>
            </tr>
            
                <tr class="rowA separator">
                    
                        <td>
                            <h3>12/6</h3>
                        </td>
                    
                    <td>Kundeservice</td>
                    <td>18:00 &rarr; 21:30 (3.5 t)</td>
                    <td style="max-width: 20em;"></td>

                    <td>
                      
                        <a href="/shifts/ajax/popup/595390/" class="swap shiftpop">
                          Byt denne vagt
                        </a>
                      
                    </td>
                    
                    <td><a href="/shifts/detail/595390/">Detaljer</td>
                      
                      <td>
                        
                          &nbsp;
                        
                    </td>
                </tr>

MendelG · Accepted Answer

Here are two approaches to find the correct <table>:

Since the table you want is the last one in the HTML, you can use find_all() and using index slicing [-1] to find the last table:

print(soup.find_all("table", class_="list")[-1])
Find the h2 element by text, and the use the find_next() method to find the table:

print(soup.find(lambda tag: tag.name == "h2" and "Mine neaste vagter" in tag.text).find_next("table"))

QHarr · Answer

You can use :-soup-contains (or just :contains) to target the <h2> by its text and then use find_next to move to the table:

from bs4 import BeautifulSoup as bs

html = '''your html'''
soup = bs(html,  'lxml')
soup.select_one('h2:-soup-contains("Mine neaste vagter")').find_next('table')

This is assuming the HTML, as shown, is returned by whatever access method you are using.

Use beautifulSoup to find a table after a header?

Tags:

python

python-3.x

beautifulsoup

web-scraping

ThomasHoej

2 Answers

MendelG

QHarr

Recent Activity

Donate For Us

Use beautifulSoup to find a table after a header?

Tags:

python

python-3.x

beautifulsoup

web-scraping

ThomasHoej

2 Answers

MendelG

QHarr

Related questions

Recent Activity

Donate For Us