I want to extract js script in script tag.
this the script tag :
<script>
$(document).ready(function(){
$("#div1").click(function(){
$("#divcontent").load("ajax.content.php?p=0&cat=1");
});
$("#div2").click(function(){
$("#divcontent").load("ajax.content.php?p=1&cat=1");
});
});
</script>
I have an array of ids like ['div1', 'div2'], and I need to extract url link inside it :
so if i call a function :
getUrlOf('div1');
it will return ajax.content.php?p=0&cat=1
If you're using a newer version of cheerio (1.0.0-rc.2), you'll need to use .html() instead of .text()
const cheerio = require('cheerio');
const $ = cheerio.load('<script>script one</script> <script> script two</script>');
// For the first script tag
console.log($('script').html());
// For all script tags
console.log($('script').map((idx, el) => $(el).html()).toArray());
https://github.com/cheeriojs/cheerio/issues/1050
With Cheerio, it is very easy to get the text of the script tag:
const cheerio = require('cheerio');
const $ = cheerio.load("the HTML the webpage you are scraping");
// If there's only one <script>
console.log($('script').text());
// If there's multiple scripts
$('script').each((idx, elem) => console.log(elem.text()));
From here, you're really just asking "how do I parse a generic block of javascript and extract a list of links". I agree with Patrick above in the comments, you probably shouldn't. Can you craft a regex that will let you find each link in the script and deduce the page it links to? Yes. But very likely, if anything about this page changes, your script will immediately break - the author of the page might switch to inline <a> tags, refactor the code, use live events, etc.
Just be aware that relying on the exact contents of this script tag will make your application very brittle -- even more brittle than page scraping generally is.
Here's an example of a loose but effective regex:
let html = "incoming html";
let regex = /\$\("(#.+?)"\)\.click(?:.|\n)+?\.load\("(.+?)"/;
let match;
while (match = regex.exec(html)) {
console.log(match[1] + ': ' + match[2]);
}
In case you are new to regex: this expression contains two capture groups, in parens (the first is the div id, the second is the link text), as well as a non-capturing group in the middle, which exists only to make sure the regex will continue through a line break. I say it's "loose" because the match it is looking for looks like this:
***").click***ignored chars***.load("***"So, depending on how much javascript there is and how similar it is, you might have to tighten it up to avoid false positives.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With