Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

blacklist characters are not ignored by Tesseract OCR

Tags:

ios

ocr

tesseract

I am using Tessearct OCR for recognizing charcters of a image. But I want numeric characters to be ignored by OCR using

_tesseract->SetVariable("tessedit_char_blacklist", "0123456789");

By this way OCR doesn't recognize numeric charactes but it provides me some others characters in place of them which I don't want.

As an example : There is an image which has text as USD 12 , when I apply OCR on that image it provides me USD fl

as we can see above that OCR converted 12 to fl which I don't want . I want 12 to be ignored by OCR.

Is there any way to get result as USD not as USD fl

Provide me any solution for that. Any help will be appreciable.

like image 636
Nishant Tyagi Avatar asked Nov 24 '25 14:11

Nishant Tyagi


2 Answers

See this comment for the method SetVariable() :

// For most variables, it is wise to set them before calling Init.

I had the same issue than you and moving the code before Init fixed it :

tess = new TessBaseAPI();    
tess->SetVariable("tessedit_char_whitelist", 
   "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
tess->SimpleInit([dataPath cStringUsingEncoding:NSUTF8StringEncoding],  
   "eng", false);
like image 162
Arnaud Avatar answered Nov 27 '25 05:11

Arnaud


That's not what tessedit_char_blacklist is for. tessedit_char_blacklist guarantees that numbers will not be in the image. If you tell Tesseract incorrect information, you'll get bad results.

What you want instead is to post-process Tesseract's output. Let it output the correct OCR, and then just strip out the number characters.

like image 41
Zachary Vance Avatar answered Nov 27 '25 06:11

Zachary Vance



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!