I have an image that I can't get tesseract to recognise as text. All my input text will be URLs.

As you can see, the image is as clear as it can be.
When running tesseract test2.png stdout it returns http:II11111111111111111111111111111111111
1111111111111111111.coml
Which is close, but not correct.
When setting the tessedit_char_whitelist parameter to htp:/1.com it recognises the string correctly (but I want more general recognition of URLs as well).
Passing in a pattern file that looks like below using command line tesseract test2.png stdout --user-patterns ./patterns.txt
\n\*://\n\*
http://\n\*
\n\*.com
doesn't help with recognition. It still prefers I over /. (More details about the pattern file )
I have also tried to set the parameters ok_repeated_ch_non_alphanum_wds to include / (and chs_trailing_punct{1,2} for trailing /, but it doesn't seem to work. Specifying --user-words doesn't help either. (With "words" being http://)
Is there a way of specifying char priority for tesseract?
Version info:
$ tesseract -v
tesseract 3.04.01
leptonica-1.73
libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
You can achieve this by adding the following line to your unicharambigs file:
3 : I I 3 : / / 1
combine_tessdata -e eng.traineddata eng.unicharambigsnano eng.unicharambigs (make sure to use tabs after both 3s and the second /).combine_tessdata -o eng.traineddata eng.unicharambigsOutput using the amended traineddata file:
$ tesseract test2.png stdout
http://11111111111111111111111111111111111
1111111111111111111.coml
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With