Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use beautifulSoup to find a table after a header?

I am trying to scrape some data off a website. The data that I want is listed in a table, but there are multiple tables and no ID's. I then had the idea that I would find the header just above the table I was searching for and then use that as an indicator.

This has really troubled me, so as a last resort, I wanted to ask if there were someone who knows how to BeautifulSoup to find the table. A snipped of the HTML code is provided beneath, thanks in advance :)

The table I am interested in, is the table right beneath <h2>Mine neaste vagter</h2>

        <h2>Min aktuelle vagt</h2>
        
        
            <div>
                <a href='/shifts/detail/595212/'>Flere detaljer</a>
            <p>Vagt starter: <b>11/06 2021 - 07:00</b></p>
            <p>Vagt slutter: <b>11/06 2021 - 11:00</b></p>

            

            

            
                <h2>Masker</h2>
                <table class='list'>
                    <tr><th>Type</th><th>Fra</th><th>&nbsp;</th><th>Til</th></tr>
                    
                    <tr>
                        <td>Fri egen regningD</td>
                        <td>07:00</td>
                        <td>&nbsp;-&nbsp;</td>
                        <td>11:00</td>
                    </tr>
                    
                </table>
            
            </div>
        
    <hr>
    
    
    
    
    
    


    
    
    
    
    
    




    
        <h2>Mine neaste vagter</h2>
        <table class='list'>
            <tr>
                <th class="alignleft">Dato</th>
                <th class="alignleft">Rolle</th>
                <th class="alignleft">Tidsrum</th>
                <th></th>
                <th class="alignleft">Bytte</th>
                <th class="alignleft" colspan='2'></th>
            </tr>
            
                <tr class="rowA separator">
                    
                        <td>
                            <h3>12/6</h3>
                        </td>
                    
                    <td>Kundeservice</td>
                    <td>18:00 &rarr; 21:30 (3.5 t)</td>
                    <td style="max-width: 20em;"></td>

                    <td>
                      
                        <a href="/shifts/ajax/popup/595390/" class="swap shiftpop">
                          Byt denne vagt
                        </a>
                      
                    </td>
                    
                    <td><a href="/shifts/detail/595390/">Detaljer</td>
                      
                      <td>
                        
                          &nbsp;
                        
                    </td>
                </tr>
like image 789
ThomasHoej Avatar asked Oct 22 '25 16:10

ThomasHoej


2 Answers

Here are two approaches to find the correct <table>:

  1. Since the table you want is the last one in the HTML, you can use find_all() and using index slicing [-1] to find the last table:

    print(soup.find_all("table", class_="list")[-1])

  2. Find the h2 element by text, and the use the find_next() method to find the table:

    print(soup.find(lambda tag: tag.name == "h2" and "Mine neaste vagter" in tag.text).find_next("table"))

like image 62
MendelG Avatar answered Oct 25 '25 05:10

MendelG


You can use :-soup-contains (or just :contains) to target the <h2> by its text and then use find_next to move to the table:

from bs4 import BeautifulSoup as bs

html = '''your html'''
soup = bs(html,  'lxml')
soup.select_one('h2:-soup-contains("Mine neaste vagter")').find_next('table')

This is assuming the HTML, as shown, is returned by whatever access method you are using.

like image 22
QHarr Avatar answered Oct 25 '25 07:10

QHarr