I am trying to scrape some information from wikipedia pages with my Node.js
app, using jsdom
. Here is an example of what I'm doing:
jsdom.env({
url: "https://en.wikipedia.org/wiki/Bill_Gates",
features: {
FetchExternalResources: ['script'],
ProcessExternalResources: ['script'],
SkipExternalResources: false,
},
done: function (err, window) {
if (err) {
console.log("Error: ", err)
return;
}
var paras = window.document.querySelectorAll('p');
console.log("Paras: ", paras)
}
});
The weird thing is that querySelectorAll('p')
returns a NodeList
of empty elements:
Paras: NodeList {
'0': HTMLParagraphElement {},
'1': HTMLParagraphElement {},
'2': HTMLParagraphElement {},
'3': HTMLParagraphElement {},
'4': HTMLParagraphElement {},
'5': HTMLParagraphElement {},
'6': HTMLParagraphElement {},
'7': HTMLParagraphElement {},
...
62': HTMLParagraphElement {} }
Any idea on what could be the problem? Thanks!
EDIT:
I got the same result when replacing window.document.querySelectorAll('p')
with window.document.getElementsByTagName('p')
The elements are not empty it just won't show you the result in a console log.
You have to access data on them (textContent
for it for example)
Try this:
Array.prototype.slice.call(dom.window.document.getElementsByTagName("p")).map(p => {
console.log(p.textContent);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With