I'm trying to use jsoup to login to a site and then scrape information, I am running into in a problem, I can login successfully and create a Document from index.php but I cannot get other pages on the site. I know I need to set a cookie after I post and then load it when I'm trying to open another page on the site. But how do I do this? The following code lets me login and get index.php
Document doc = Jsoup.connect("http://www.example.com/login.php")                .data("username", "myUsername",                       "password", "myPassword")                .post(); I know I can use apache httpclient to do this but I don't want to.
What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation.
When you login to the site, it is probably setting an authorised session cookie that needs to be sent on subsequent requests to maintain the session.
You can get the cookie like this:
Connection.Response res = Jsoup.connect("http://www.example.com/login.php")     .data("username", "myUsername", "password", "myPassword")     .method(Method.POST)     .execute();  Document doc = res.parse(); String sessionId = res.cookie("SESSIONID"); // you will need to check what the right cookie name is And then send it on the next request like:
Document doc2 = Jsoup.connect("http://www.example.com/otherPage")     .cookie("SESSIONID", sessionId)     .get(); //This will get you the response. Response res = Jsoup     .connect("loginPageUrl")     .data("loginField", "[email protected]", "passField", "pass1234")     .method(Method.POST)     .execute();  //This will get you cookies Map<String, String> loginCookies = res.cookies();  //And this is the easiest way I've found to remain in session Document doc = Jsoup.connect("urlYouNeedToBeLoggedInToAccess")       .cookies(loginCookies)       .get(); If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With