I have to convert a <code>.pdf</code> file containing scanned images into <code>.txt</code> files. The <code>tesseract ocr</code> converts only images to <code>.txt</code>, but I need to first extract the <code>.tif</code> images and then convert it. Can anyone help me with this?

Use Imagemagick: <pre class="prettyprint"><code>convert -density 600 input.pdf output.tif </code></pre> Density is in DPI, from my experience 600 DPI works the best.

Convert scanned pdf to .txt files using tesseract

Tags:

tesseract

I have to convert a .pdf file containing scanned images into .txt files. The tesseract ocr converts only images to .txt, but I need to first extract the .tif images and then convert it. Can anyone help me with this?

970

asked Jan 31 '14 05:01

Ganesh Nannaware

1 Answers

Use Imagemagick:

convert -density 600 input.pdf output.tif

Density is in DPI, from my experience 600 DPI works the best.

144

answered Oct 19 '22 05:10

Karol S

Related questions
                            
                                Does Tessaract OCR uses neural networks as their default training mechanism
                            
                                Tesseract: Specifying regions of text
                            
                                Explicitly set the font to be used for recognition by Tesseract-OCR
                            
                                Resources containing OCR benchmark test-sets for free [closed]
                            
                                7-Segment Display OCR
                            
                                Doing OCR with R
                            
                                python-tesseract OCR: get digits only
                            
                                Is there any way to improve tesseract OCR with small fonts?
                            
                                Page layout analysis using Tesseract?
                            
                                Where is the default tesseract installation folder on a mac?
                            
                                Pytesseract set character whitelist
                            
                                Where I can find the list of available property name for tesseract->setvariable function's first parameter?
                            
                                How does one install Tesseract-OCR 3.03 in Ubuntu/Linux distributions?
                            
                                Open-CV - Not loading correctly
                            
                                Difference between Tesseract 3 and Tesseract 4?
                            
                                OCR: Image to text?
                            
                                Python error when importing image_to_string from tesseract
                            
                                Custom Dictionary for Tesseract
                            
                                Image preprocessing with OpenCV before doing character recognition (tesseract)
                            
                                "Adding" new fonts to Tesseract eng.traineddata

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With