Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cheerio find a text in a script tag

I want to extract js script in script tag.

this the script tag :

<script>
  $(document).ready(function(){

    $("#div1").click(function(){
      $("#divcontent").load("ajax.content.php?p=0&cat=1");
    });

    $("#div2").click(function(){
      $("#divcontent").load("ajax.content.php?p=1&cat=1");
    });

  });
</script>

I have an array of ids like ['div1', 'div2'], and I need to extract url link inside it : so if i call a function :

getUrlOf('div1');

it will return ajax.content.php?p=0&cat=1

like image 252
yozawiratama Avatar asked Nov 17 '25 16:11

yozawiratama


2 Answers

If you're using a newer version of cheerio (1.0.0-rc.2), you'll need to use .html() instead of .text()

const cheerio = require('cheerio');
const $ = cheerio.load('<script>script one</script>  <script>  script two</script>');

// For the first script tag
console.log($('script').html());

// For all script tags
console.log($('script').map((idx, el) => $(el).html()).toArray());

https://github.com/cheeriojs/cheerio/issues/1050

like image 87
ryanbraganza Avatar answered Nov 19 '25 08:11

ryanbraganza


With Cheerio, it is very easy to get the text of the script tag:

const cheerio = require('cheerio');
const $ = cheerio.load("the HTML the webpage you are scraping");

// If there's only one <script>
console.log($('script').text());

// If there's multiple scripts
$('script').each((idx, elem) => console.log(elem.text()));

From here, you're really just asking "how do I parse a generic block of javascript and extract a list of links". I agree with Patrick above in the comments, you probably shouldn't. Can you craft a regex that will let you find each link in the script and deduce the page it links to? Yes. But very likely, if anything about this page changes, your script will immediately break - the author of the page might switch to inline <a> tags, refactor the code, use live events, etc.

Just be aware that relying on the exact contents of this script tag will make your application very brittle -- even more brittle than page scraping generally is.

Here's an example of a loose but effective regex:

let html = "incoming html";
let regex = /\$\("(#.+?)"\)\.click(?:.|\n)+?\.load\("(.+?)"/;
let match;

while (match = regex.exec(html)) {
    console.log(match[1] + ': ' + match[2]);
}

In case you are new to regex: this expression contains two capture groups, in parens (the first is the div id, the second is the link text), as well as a non-capturing group in the middle, which exists only to make sure the regex will continue through a line break. I say it's "loose" because the match it is looking for looks like this:

  • $("***").click***ignored chars***.load("***"

So, depending on how much javascript there is and how similar it is, you might have to tighten it up to avoid false positives.

like image 36
Elliot Nelson Avatar answered Nov 19 '25 06:11

Elliot Nelson