I have to convert a .pdf
file containing scanned images into .txt
files. The tesseract ocr
converts only images to .txt
, but I need to first extract the .tif
images and then convert it. Can anyone help me with this?
Instead of relying on PDF structure to extract the underlying text, we can convert PDF into Image(s), then use an OCR engine (e.g., Tesseract) to extract text from the image(s).
Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.
There are many applications to what OCR can do in term of document intelligence. Using pytesseract, one can extract almost all the data irrespective of the format of the documents (whether its a scanned document or a pdf or a simple jpeg image).
Use Imagemagick:
convert -density 600 input.pdf output.tif
Density is in DPI, from my experience 600 DPI works the best.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With