Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get html source code from a website and then get an element from the html file

I want to get HTML code of a website and then get a certain element from that HTML file.

There are things that can get HTML code like ajax and jquery. I am using node and want it to be in total javascript. Also, I have no idea how to get a certain element from that.

I have done this in python but I need it in javascript. For simplicity. Let's take the website- https://example.com. This is the body of the HTML code of website.

<body>
<div>
    #Some Stuff 
</div>
</body>

I want to get the div class lets take <div> to be <div class="test"> to make it easier.

Finally, I want to get- the content of <div class="test">

Like this-

<div class="test">
    #Some Stuff 
</div>

Thanks in Advance

like image 270
Weirdo914 Avatar asked Sep 06 '25 23:09

Weirdo914


1 Answers

For Node.js there are two native fetching modules: http and https. If you're looking to scrape with a Node.js application, then you should probably use https, get the page's html, parse it with an html parser, I'd recommend cheerio. Here's an example:

// native Node.js module
const https = require('https')
// don't forget to `npm install cheerio` to get the parser!
const cheerio = require('cheerio')

// custom fetch for Node.js
const fetch = (method, url, payload=undefined) => new Promise((resolve, reject) => {
    https.get(
        url,
        res => {
            const dataBuffers = []
            res.on('data', data => dataBuffers.push(data.toString('utf8')))
            res.on('end', () => resolve(dataBuffers.join('')))
        }
    ).on('error', reject)
})

const scrapeHtml = url => new Promise((resolve, reject) =>{
  fetch('GET', url)
  .then(html => {
    const cheerioPage = cheerio.load(html)
    // cheerioPage is now a loaded html parser with a similar interface to jQuery
    // FOR EXAMPLE, to find a table with the id productData, you would do this:
    const productTable = cheerioPage('table .productData')

    // then you would need to reload the element into cheerio again to
    // perform more jQuery like searches on it:
    const cheerioProductTable = cheerio.load(productTable)
    const productRows = cheerioProductTable('tr')

    // now we have a reference to every row in the table, the object
    // returned from a cheerio search is array-like, but native JS functions
    // such as .map don't work on it, so we need to do a manually calibrated loop:
    let i = 0
    let cheerioProdRow, prodRowText
    const productsTextData = []
    while(i < productRows.length) {
      cheerioProdRow = cheerio.load(productRows[i])
      prodRowText = cheerioProdRow.text().trim()
      productsTextData.push(prodRowText)
      i++
    }
    resolve(productsTextData)
  })
  .catch(reject)
})

scrapeHtml(/*URL TO SCRAPE HERE*/)
.then(data => {
  // expect the data returned to be an array of text from each 
  // row in the table from the html we loaded. Now we can do whatever
  // else you want with the scraped data. 
  console.log('data: ', data)
})
.catch(err => console.log('err: ', err)

Happy scraping!

like image 170
Jacob Penney Avatar answered Sep 08 '25 12:09

Jacob Penney