Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any limit on number of pdf pages to be OCRed using AWS Textract?

I am OCRing image based pdfs using AWS Textract

my each PDF I have has 60+ pages

but when I try to OCR the pdf file it only does that for the first 4 pages of each file.

is there any limit on number of pages in the pdf file for AWS extract

I found this https://docs.aws.amazon.com/textract/latest/dg/limits.html

but it does not mention any limit on the number of pages!!

Any one know if there is any limit of the pdf pages?

and if so, how can I do the OCR for the whole file 60+ pages?

like image 322
asmgx Avatar asked Sep 11 '25 21:09

asmgx


2 Answers

The hard limits for textract are 1000 pages or 500mb for PDFs.

I think that your problem is related to the batch response of textract. You have to look if the key "NextToken" in the json output is populated and if so, you have to make another request with that token.

like image 103
ale Avatar answered Sep 15 '25 01:09

ale


For asynchronous operations, JPEG and PNG files have a limit of 10 MB in memory. PDF and TIFF files have a limit of 500 MB in memory. PDF and TIFF files have a limit of 3,000 pages.

Are you getting four files in response for the 60+ page document? It could very well be the responses of all the 60+ pages are within those four output files. Please note Textract asynchronous job responses are saved in 1000 Blocks per file, and not one page per file.

For reference:

  1. Block format: https://docs.aws.amazon.com/textract/latest/dg/API_Block.html
  2. Documentation of Amazon Textract set quotas (limits which are non-configurable): https://docs.aws.amazon.com/textract/latest/dg/limits-document.html
like image 33
Rohan Kumar Avatar answered Sep 15 '25 00:09

Rohan Kumar