Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert scanned pdf to .txt files using tesseract

Tags:

tesseract

I have to convert a .pdf file containing scanned images into .txt files. The tesseract ocr converts only images to .txt, but I need to first extract the .tif images and then convert it. Can anyone help me with this?

like image 970
Ganesh Nannaware Avatar asked Jan 31 '14 05:01

Ganesh Nannaware


People also ask

Can Tesseract extract text from PDF?

Instead of relying on PDF structure to extract the underlying text, we can convert PDF into Image(s), then use an OCR engine (e.g., Tesseract) to extract text from the image(s).

How do I convert a scanned PDF to text?

Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.

Can Tesseract read scanned PDF?

There are many applications to what OCR can do in term of document intelligence. Using pytesseract, one can extract almost all the data irrespective of the format of the documents (whether its a scanned document or a pdf or a simple jpeg image).


1 Answers

Use Imagemagick:

convert -density 600 input.pdf output.tif

Density is in DPI, from my experience 600 DPI works the best.

like image 144
Karol S Avatar answered Oct 19 '22 05:10

Karol S