Detecting Bangla character using pytesseract

Question

I am trying to detect bangla character from image using python, so i decided to use pytesseract. For this purpose i have used below code:

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

im = Image.open("input.png") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.png')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
text = pytesseract.image_to_string(Image.open('temp2.png'),lang="ben")
print text

The problem is that if i gave a image of english character is detects. But when i am writing lang="ben" and detecting from image of bengali characters my code is running for endless time or like forever.

P.S: I have downloaded bengali language train data to tessdata folder and i am trying to run it in PyCharm.

Can anyone help me to solve this problem?

sample of input.png

thewaywewere · Accepted Answer

I added Bangla(india) language to Windows. Downloaded ben.traineddata to TESSDATA_PREFIX which equals to C:\Program Files\Tesseract 4.0.0 essdata in my PC. Then run,

> tesseract -l ben bangla.jpg bangla_out

in command prompt and got the result below in 2 seconds. The result looks fine even I don't understand the language.

enter image description here

Have you tried to run tesseract in command prompt to verify if it works for -l ben?

EDIT:

Used Spyder, similar to PyCharm, which comes with Anaconda to test it. Modified your code to call Tesseract as below.

pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"

Test Code in Spyder:

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
import os

im = Image.open("bangla.jpg") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save("bangla_pp.jpg")

pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"
text = pytesseract.image_to_string(Image.open("bangla_pp.jpg"),lang="ben")
print text

It works and produced result below on the processed image. Apparently, the OCR result of the processed image is not as good as the original one.

Result from the processed bangla_pp.jpg:

   প্রত্যাবর্তনকারীরা
   তাঁদের দেশে গিয়ে

   -~~-<~~~~--

   প্রত্যাবর্তন-পরবর্তী
   আর্থিক সহায়তা
    = পাবেন তার

Result from original image, directly feed to Tesseract.

Code:

from PIL import Image    
import pytesseract as tess

print tess.image_to_string(Image.open('bangla.jpg'), lang='ben')

Output:

প্রত্যাবর্তনকারীরা
তাঁদের দেশে গিয়ে

প্রত্যাবর্তন-পরবর্তী
আর্থিক সহায়তা
পাবেন তার

Detecting Bangla character using pytesseract

Tags:

python

python-tesseract

Pial Kanti

1 Answers

thewaywewere

Recent Activity

Donate For Us

Detecting Bangla character using pytesseract

Tags:

python

python-tesseract

Pial Kanti

1 Answers

thewaywewere

Related questions

Recent Activity

Donate For Us