Is there any limit on number of pdf pages to be OCRed using AWS Textract?

Question

I am OCRing image based pdfs using AWS Textract

my each PDF I have has 60+ pages

but when I try to OCR the pdf file it only does that for the first 4 pages of each file.

is there any limit on number of pages in the pdf file for AWS extract

I found this https://docs.aws.amazon.com/textract/latest/dg/limits.html

but it does not mention any limit on the number of pages!!

Any one know if there is any limit of the pdf pages?

and if so, how can I do the OCR for the whole file 60+ pages?

ale · Accepted Answer

The hard limits for textract are 1000 pages or 500mb for PDFs.

I think that your problem is related to the batch response of textract. You have to look if the key "NextToken" in the json output is populated and if so, you have to make another request with that token.

Rohan Kumar · Answer

For asynchronous operations, JPEG and PNG files have a limit of 10 MB in memory. PDF and TIFF files have a limit of 500 MB in memory. PDF and TIFF files have a limit of 3,000 pages.

Are you getting four files in response for the 60+ page document? It could very well be the responses of all the 60+ pages are within those four output files. Please note Textract asynchronous job responses are saved in 1000 Blocks per file, and not one page per file.

For reference:

Block format: https://docs.aws.amazon.com/textract/latest/dg/API_Block.html
Documentation of Amazon Textract set quotas (limits which are non-configurable): https://docs.aws.amazon.com/textract/latest/dg/limits-document.html

Is there any limit on number of pdf pages to be OCRed using AWS Textract?

Tags:

pdf

amazon-web-services

amazon-textract

asmgx

2 Answers

ale

Rohan Kumar

Recent Activity

Donate For Us

Is there any limit on number of pdf pages to be OCRed using AWS Textract?

Tags:

pdf

amazon-web-services

amazon-textract

asmgx

2 Answers

ale

Rohan Kumar

Related questions

Recent Activity

Donate For Us