Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to parse page text, getting "ReferenceError: ReadableStream is not defined"

I am currently trying to create a util to parse annotations from a PDF. I can load the PDF file just fine, the annotation objects just fine, but I need to obtain the text that is related to those annotations (underlined, highlighted, etc.).

This gets hairy when I try to use the getTextContent() method which fails. Below is the method where this happens:

/**
 * @param pdf The PDF document obtained upon `pdfjs.getDocument(pdf).promise` success.
 */
function getAllPages(pdf) {
  return new Promise((resolve, reject) => {
    let allPromises = [];
    for (let i = 0; i < numPages; i++) {
      const pageNumber = i + 1; // note: pages are 1-based
      const page = pdf.getPage(pageNumber)
        .then((pageContent) => {

          // testing with just one page to see what's up
          if (pageNumber === 1) {
            try {
              pageContent.getTextContent()
                .then((txt) => {
                  // THIS NEVER OCCURS
                  console.log('got text');
                })
                .catch((error) => {
                  // THIS IS WHERE THE ERROR SHOULD BE CAUGHT
                  console.error('in-promise error', error)
                });
            } catch (error) {
              // AT LEAST IT SHOULD BE CAUGHT HERE
              console.log('try/catch error:', error);
            }
          }
        })
        .catch(reject);

      allPromises.push(page);
    }
    Promise.all(allPromises)
      .then(() => {
        allPagesData.sort(sortByPageNumber);
        resolve(allPagesData);
      })
      .catch(reject);
  });
}

When calling pageContent.getTextContent(), which should return a promise, the error "ReferenceError: ReadableStream is not defined" is thrown in the catch() part of the try.

This is weird because I would have expected the pageContent.getTextContent().catch() to be able to, well, catch that. Also, I don't know what to do to resolve this.

Any help is appreciated.

like image 291
jansensan Avatar asked Oct 17 '25 18:10

jansensan


2 Answers

I have noticed that using pdfjs-dist causes the error.

Use pdfjs-dist/es5/build/pdf.js instead.

const pdfjs = require('pdfjs-dist/es5/build/pdf.js');

Update:

const pdfJs = require('pdfjs-dist/legacy/build/pdf')

Example usage

like image 155
Shihab Avatar answered Oct 19 '25 08:10

Shihab


There was a new change, the only way it worked here was to use this path:

const pdfJs = require('pdfjs-dist/legacy/build/pdf')
like image 41
Josias da Paixao junior Avatar answered Oct 19 '25 07:10

Josias da Paixao junior