Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do only page segmentation / layout detection with Tesseract (mode --psm 2)?

I would like to use page segmentation from Tesseract without running the OCR, as I have my own custom OCR model, and it takes to long to run page segmentation AND OCR. I tried using the --psm 2 mode in command line mode of Tesseract, and in pytesseract, and it didn't work as promised.

I'm working in Linux, and am coding in Python 3.10.

I currently use the tesseract-ocr-api from layoutparser Documentation. The code looks like the following:

import layoutparser as lp
ocr_agent = lp.TesseractAgent()
res = ocr_agent.detect(img_path, return_response=True)
layout_info = res['data']

The layout_info then is a pd.DataFrame and contains Layout information on the level of blocks, paragraph, lines and words and also the OCR output. The problem is that this is very slow; on my machine it takes 7s per image and I actually don't need the OCR. Hence, I want page segmentation (also sometimes called layout detection) only.

According to the Tesseract (Documentation), there is the --psm 2mode "Automatic page segmentation, but no OSD, or OCR". When I try this in the command line, this does not produce an output file (even if the output type is defined):

tesseract img.png outfile --psm 2
tesseract img.png outfile --psm 2 tsv

I also tried working with the python wrapper pytesseract, but it is quite slow and it again returns the pd.DataFrame with the layout AND OCR data, despite --psm 2 being specified:

import cv2
import pytesseract

img = cv2.imread(img_path)
layout_info = pytesseract.image_to_data(img, config='tsv --psm 2', output_type='data.frame')

I'm using pytesseract==0.3.10 and tesseract 5.3.3-30-gea0b.

Do you have any ideas on how I can achieve page segmentation without OCR with Tesseract (or at least speed up the processing time of page segmenation + OCR?

like image 986
Vera Bernhard Avatar asked Nov 29 '25 18:11

Vera Bernhard


2 Answers

You can check, if -psm2 is implemented in your tesseract with the command:

tesseract --help-psm 2

Output on my machine:

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

Gives you the Info:

--psm 2 Automatic page segmentation, but no OSD, or OCR. (not implemented)

Therefore, if not implemented you can't use it.

Process time is related to the image quality and amount of text. Have you a example, where ocr makes a time problem?

like image 164
Hermann12 Avatar answered Dec 01 '25 06:12

Hermann12


Have a look at tesserocr - python wrapper of tesseract API. With it you can access also functionality not available via the tesseract executable (pytesseract just wraps tesseract executable without direct access to its API).

I did not test it, but with tesserocr you can use AnalyseLayout without running Recognize - see function documentation in the tesseract source code.

Tesseract process time depends also on your hw (e.g. ssd vs hdd, availability of SSE/AVX or NEON instruction).

like image 29
user898678 Avatar answered Dec 01 '25 06:12

user898678