Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding DetectedBreak in google OCR full text annotations

I am trying to convert the full-text annotations of google vision OCR result to line level and word level which is in Block,Paragraph,Word and Symbol hierarchy.

However, when converting symbols to word text and word to line text, I need to understand the DetectedBreak property.

I went through This documentation.But I did not understand few of the them.

Can somebody explain what do the following Breaks mean? I only understood LINE_BREAK and SPACE.

  1. EOL_SURE_SPACE
  2. HYPHEN
  3. LINE_BREAK
  4. SPACE
  5. SURE_SPACE
  6. UNKNOWN

Can they be replaced by either a newline char or space ?

like image 624
Arun Gowda Avatar asked Oct 21 '25 11:10

Arun Gowda


1 Answers

The link you provided has the most detailed explanation available of what each of these stands for. I suppose the best way to get a better understanding is to run ocr on different images and compare the response with what you see on the corresponding image. The following python script runs DOCUMENT_TEXT_DETECTION on an image saved in GCS and prints all detected breaks except from the ones you have no trouble understanding (LINE_BREAK and SPACE), along with the word immediately preceding them to enable comparison.

import sys
import os
from google.cloud import storage
from google.cloud import vision

def detect_breaks(gcs_image):

    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/json'
    client = vision.ImageAnnotatorClient()

    feature = vision.types.Feature(
        type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)

    image_source = vision.types.ImageSource(
        image_uri=gcs_image)

    image = vision.types.Image(
        source=image_source)

    request = vision.types.AnnotateImageRequest(
        features=[feature], image=image)

    annotation = client.annotate_image(request).full_text_annotation

    breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType
    word_text = ""
    for page in annotation.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                for word in paragraph.words:
                    for symbol in word.symbols:
                        word_text += symbol.text
                        if symbol.property.detected_break.type:
                            if symbol.property.detected_break.type == breaks.SPACE or symbol.property.detected_break.type == breaks.LINE_BREAK:
                                word_text = ""
                            else:
                                print word_text,symbol.property.detected_break
                                word_text = ""

if __name__ == '__main__':
    detect_breaks(sys.argv[1])
like image 85
Lefteris S Avatar answered Oct 23 '25 08:10

Lefteris S



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!