Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieve rendered HTML DOM in pure Java

I know there are already some similar questions here. But I do not want to build a browser in Java, I only want to see the source code fully generated (or "rendered"). As if I look at the generated DOM in the browser. Does anybody know a tool for that?

I had a look at Cobra and HtmlUnit, but they dont seem to be able to render more complex websites correctly. Especially if there are AJAX calls adding content to the site after it has loaded. I really need a tool that does the same as a browser does, without the actual display of it. Do I have to remote control a browser in the end?

Does anybody has experience with that?

A very similar question but without any satisfying answeres can be found here.

like image 656
morja Avatar asked Nov 14 '22 10:11

morja


1 Answers

I don't believe that a library exists that does scraping of the asynchronous calls after the page is loaded.

My recommendation is:

  1. Get the HTML of a page using Cobra or a similar library.
  2. Parse the source for AJAX requests. (for example, the ajax call will have a URL parameter and a "data" JSON string you can use for the request)
  3. For each AJAX call, make another request to the URL parameter you captured.
  4. Append the result from each AJAX call to the source of your HTML from the original page.

It's not a perfect solution and it will not help you in the scenarios that require the user to trigger an event. Also your code for capturing the URLs for the AJAX events will differ depending on what javascript library the website is using to make its async calls.

Hope that helps.

like image 198
bsimic Avatar answered Nov 16 '22 04:11

bsimic