Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup How to get href links from pseudo-element/Class

I am trying to parse https://www.tandfonline.com/toc/icbi20/current for the titles of all articles. The HTML is divided into Volumes and Issues. Each Volume has an Issue that corresponds to a Month. So for Volume 36 there would be 12 Issues. In the current Volume (37) there are 4 Issues and I would like to parse through each Issue and get each Article's name.

enter image description here

To accomplish this and automate the search I need to fetch the href links for each Issue. Initially I chose the parent's div id: id = 'tocList'.

import requests
from bs4 import BeautifulSoup, SoupStrainer

chronobiology = requests.get("https://www.tandfonline.com/toc/icbi20/current")
chrono_coverpage = chronobiology.content

issues = SoupStrainer(id ='tocList')
issues_soup = BeautifulSoup(chrono_coverpage, 'html.parser', parse_only = issues)
for issue in issues_soup:
    print(issue)

This returns a bs4 object BUT only with href links from the Volume div. What's worse is that this div should encompass both Volume div and Issue div.

So, I decided trying to reduce my search space and make it more specific and chose the div containing the Issue href links (class_='issues') enter image description here

This time Jupiter will think for a bit but won't return ANYTHING. Just blank. Nothing. Zippo. BUT if I ask what type of "nothing" has been returned, jupiter informs it is a "String"??? I just don't know what to make of this. enter image description here

So, firstly I had a question, why is it that the Issue div element does not respond to the parsing? When I try running print(BeautifulSoup(chrono_coverpage, 'html.parser').prettify()) the same occurs, the Issue div does not appear (When Inspect Element on the html page it appears immediatly beneath the final Volume span):

enter image description here

So I suspect that it must be javascript oriented or something, not so much HTML oriented. Or maybe the class = 'open' has something to do with it.

Any clarifications would be kindly appreciated. Also, how would one parse through Javascripted links to get them?

like image 462
Pablo Rodriguez Avatar asked Sep 14 '25 11:09

Pablo Rodriguez


1 Answers

Okay, so I've "resolved" the issue though I need to fill in some theoretical gaps:

Firstly this snippet holds the key to the beginning of solving the answer:

enter image description here

As can be seen, the <div class = 'container'> is immediatly followed by a ::before pseudo-element and the Links I am interested in are contained inside a div immediatly beneath this pseudo-element. This last div is then finished with the ::after pseudo-element.

Firstly I realized that my problem was that I needed to select a pseudo-element. I found this to be quite impossible with BeutifulSoup's soup.select() since apparently BeautifulSoup uses Soup Sieve which "aims to allow users to target XML/HTML elements with CSS selectors. It implements many pseudo-classes [...]."

The last part of the paragraph states:

"Soup Sieve also will not match anything for pseudo classes that are only relevant in a live, browser environment, but it will gracefully handle them if they've been implemented;"

So this got me thinking that I have no idea what "pseudo classes that are only relevant in a live browser environment" means. But then I said to myself, "but it also said that had they been implemented, BS4 should be able to parse them". And since I can definitely see the div elements containing my href links of interest using the Inspect tool, I though that I must be implemented.

The first part of that phrase got me thinking: "But do I need a live browser for this to work?"

So that brought me to Selenium's web driver:

import requests
from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver

driver = webdriver.Chrome()
url_chronobiology = driver.get("https://www.tandfonline.com/toc/icbi20/current")
chronobiology_content = driver.page_source
chronobiology_soup = BeautifulSoup(chronobiology_content)
chronobiology_soup.select('#tocList > div > div > div.yearContent > div.issues > div > div')
[Out]: []

Clearly this result made me sad because I thought I had understood what was going on. But then I though that if I 'clicked' one of the issues, from the previously opened browser that it would work (for some reason, to be honest I'm pretty sure desperation led me to that thought).

Well, surprise surprise. It worked: After clicking on the "Issue 4" and re running the script, I got what I was looking for:

enter image description here

UNANSWERED QUESTIONS?

1 - Apparently these pseudo-elements only "exist" when clicked upon, because otherwise the code doesn't recognize they are there. Why?

2 - What code must be run in order to make an initial click and activiate these pseudo-elements so the code can automatically open these links and parse the information I want? (title of articles)

UPDATE

Question 2 is answered using Selenium's ActionChain:

import requests
from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains

driver = webdriver.Chrome()
url_chronobiology = driver.get("https://www.tandfonline.com/toc/icbi20/current")
chronobiology_content = driver.page_source
chronobiology_soup = BeautifulSoup(chronobiology_content)
action=ActionChains(driver)
action.move_to_element(driver.find_element_by_xpath('//*[@id="tocList"]/div/div/div[3]/div[2]/div')).perform()

chronobiology_soup.select('#tocList > div > div > div.yearContent > div.issues > div > div')
[Out]: 
[<div class="loi-issues-scroller">
 <a class="open" href="/toc/icbi20/37/4?nav=tocList">Issue<span>4</span></a>
 <a class="" href="/toc/icbi20/37/3?nav=tocList">Issue<span>3</span></a>
 <a class="" href="/toc/icbi20/37/2?nav=tocList">Issue<span>2</span></a>
 <a class="" href="/toc/icbi20/37/1?nav=tocList">Issue<span>1</span></a>
 </div>]

The only downside is that one must stay on the page for Selenium's ActionChain.perform() can actually click the element, but at least I've automated this step.

If someone could answer question 1 that would be great

like image 50
Pablo Rodriguez Avatar answered Sep 17 '25 02:09

Pablo Rodriguez