I am OCRing image based pdfs using AWS Textract
my each PDF I have has 60+ pages
but when I try to OCR the pdf file it only does that for the first 4 pages of each file.
is there any limit on number of pages in the pdf file for AWS extract
I found this https://docs.aws.amazon.com/textract/latest/dg/limits.html
but it does not mention any limit on the number of pages!!
Any one know if there is any limit of the pdf pages?
and if so, how can I do the OCR for the whole file 60+ pages?
The hard limits for textract are 1000 pages or 500mb for PDFs.
I think that your problem is related to the batch response of textract. You have to look if the key "NextToken" in the json output is populated and if so, you have to make another request with that token.
For asynchronous operations, JPEG and PNG files have a limit of 10 MB in memory. PDF and TIFF files have a limit of 500 MB in memory. PDF and TIFF files have a limit of 3,000 pages.
Are you getting four files in response for the 60+ page document? It could very well be the responses of all the 60+ pages are within those four output files. Please note Textract asynchronous job responses are saved in 1000 Blocks per file, and not one page per file.
For reference:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With