Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JSoup doesn't load the whole HTML [duplicate]

I want to scrape a website but when I connect to it using Jsoup.connect(url) only a part of the page is loaded.

When I downloaded the page as html I saw that in one part of the page there is only a loader icon so I concluded that that part of the page is loaded afterwards from some other source.

The funny thing is that inspect element contains the missing html and view page source doesn't. HTML loaded from jSoup is basically the same as when opened from "view page source".

Is there a way to bypass this and to load the whole page as it is displayed in browser?

The page in question is this: https://www.oddsportal.com/tennis/australia/atp-australian-open-2017/results/page/1/

Ask for any additional information I could provide.

===============

EDIT: I am connecting to url like this:

Document doc = null;

try {
    doc =  Jsoup.connect(url).get();
} catch (IOException e) {
    e.printStackTrace();
}

I am getting this div using css selector:

Elements tournamentTable = doc.select("div[id=tournamentTable]");

Content of tournamentTable is <div id="tournamentTable"></div>

like image 791
wdc Avatar asked Oct 23 '25 19:10

wdc


1 Answers

It seems id=tournamentTable is generated dynamically using javascript. JSoup is not evaluating javascript, so you'd have to use library like HtmlUnit. For example:

WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true); // enable javascript
webClient.getOptions().setThrowExceptionOnScriptError(false); //even if there is error in js continue
webClient.waitForBackgroundJavaScript(5000); // important! wait until javascript finishes rendering
HtmlPage page = webClient.getPage(url);

page.getElementById("tournamentTable");
like image 107
Krzysztof Atłasik Avatar answered Oct 26 '25 08:10

Krzysztof Atłasik