blacklist characters are not ignored by Tesseract OCR

Question

I am using Tessearct OCR for recognizing charcters of a image. But I want numeric characters to be ignored by OCR using

_tesseract->SetVariable("tessedit_char_blacklist", "0123456789");

By this way OCR doesn't recognize numeric charactes but it provides me some others characters in place of them which I don't want.

As an example : There is an image which has text as USD 12 , when I apply OCR on that image it provides me USD fl

as we can see above that OCR converted 12 to fl which I don't want . I want 12 to be ignored by OCR.

Is there any way to get result as USD not as USD fl

Provide me any solution for that. Any help will be appreciable.

Arnaud · Accepted Answer

See this comment for the method SetVariable() :

// For most variables, it is wise to set them before calling Init.

I had the same issue than you and moving the code before Init fixed it :

tess = new TessBaseAPI();    
tess->SetVariable("tessedit_char_whitelist", 
   "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
tess->SimpleInit([dataPath cStringUsingEncoding:NSUTF8StringEncoding],  
   "eng", false);

Zachary Vance · Answer

That's not what tessedit_char_blacklist is for. tessedit_char_blacklist guarantees that numbers will not be in the image. If you tell Tesseract incorrect information, you'll get bad results.

What you want instead is to post-process Tesseract's output. Let it output the correct OCR, and then just strip out the number characters.

blacklist characters are not ignored by Tesseract OCR

Tags:

ios

ocr

tesseract

Nishant Tyagi

2 Answers

Arnaud

Zachary Vance

Recent Activity

Donate For Us

blacklist characters are not ignored by Tesseract OCR

Tags:

ios

ocr

tesseract

Nishant Tyagi

2 Answers

Arnaud

Zachary Vance

Related questions

Recent Activity

Donate For Us