Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Javascript: Download and interact with another page

I want to do some basic scripting and I'm trying to do it in javascript. I want to basically download a wikiquote page and scrape it.

What's the best way to do this? How do I get the page? I tried to do it via jQuery.get()

$.get('http://en.wikiquote.org/wiki/Last_words', function(data) { console.log(data); })

But the log is simply some error object and the console displays

XMLHttpRequest cannot load http://en.wikiquote.org/wiki/Last_words. Origin null is not allowed by Access-Control-Allow-Origin. en.wikiquote.org/wiki/Last_words

GET http://en.wikiquote.org/wiki/Last_words undefined (undefined)

So I guess I'm not taking the correct approach. What should I be doing?

Also, once I DO download the file, what tools are available for me to traverse it? XPath? RegEx? Is there a way to generate a DOM model from it and attach jquery?

An interesting possibility would be to somehow just open a tiny pop-up which downloads the page and then run my script to scrape the page and return data. I am aware this sounds lot like script injection. Is it even possible to do this in a friendly manner?

like image 857
George Mauer Avatar asked Nov 19 '25 08:11

George Mauer


2 Answers

Assuming you are limiting yourself to JavaScript running in the browser, and documents that are not on the same host as the page running the script — you can't.

The Same Origin security policy makes this impossible. Without it a webpage could request data from any site (including LAN sites) that the user can access, with their ip address, their cookies, and anything else that might be used for authentication. (All your banking are belong to us).

like image 132
Quentin Avatar answered Nov 21 '25 22:11

Quentin


WikiQuote exposes an API. You can use JSONP to make a request to the API and get the data all pre-parsed and ready to go:

document.body.appendChild(document.createElement("script")).src = 
    "http://en.wikiquote.org/w/api.php?action=query&titles=Last_words" +
        "&prop=revisions&rvlimit=1&rvprop=content&format=json&callback=handleQuote";

function handleQuote(quote)
{
    // quote is the response from wikiquote
}

Note that the response is returned as wiki markup, not html. You'll have to do some parsing to get html, if that's what you're after. Edit: Use action=parse&page=Last_words to get html.

You can preview the JSON response in your browser by changing the format argument from json to jsonfm and paste it in your browser:

Wiki markup:
http://en.wikiquote.org/w/api.php?action=query&titles=Last_words&prop=revisions&rvlimit=1&rvprop=content&format=jsonfm&callback=handleQuote

HTML:
http://en.wikiquote.org/w/api.php?action=parse&page=Last_words&format=jsonfm&callback=handleQuote

Edit: I really only answered half (or less) of your question. As for how to interact with the data once you've got it, jQuery makes it simple. If you pass an html string into $(), jQuery constructs the elements for you. Then, you can access it via jQuery or DOM methods:

var paragraphs = $(someHTML).find("p");

A simple way to get the HTML from any domain via JavaScript, is to make your ajax request to a local server page that requests the document for you. You could write a generic handler ashx page, with something like:

public void ProcessRequest(HttpContext context)
{
    string url = Request.QueryString["url"];
    if (Uri.IsWellFormedUriString(url, UriKind.Absolute))
    {
        context.Response.Write(new WebClient().DownloadString(url));
    }
}

And then call it with jQuery:

var url = encodeURIComponent("http://en.wikiquote.org/wiki/Last_words");
$.get("fetch.ashx?url=" + url, function (response)
{
    var $response = $(response);
});

Edit: Newer browsers do support some cross-domain data retrieval through JavaScript by implementing Cross-Origin Resource Sharing (CORS). FireFox and Chrome support CORS via XMLHttpRequest. IE8 and IE9 support CORS with XDomainRequest. The catch is that the server also has to support CORS. In short, the server must include a response header of Access-Control-Allow-Origin: * in order for the client to process the response. And sadly, it appears wikiquote is not sending that header in its response. Here's a hefty article on the internals of CORS.

like image 42
gilly3 Avatar answered Nov 21 '25 22:11

gilly3



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!