Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to return a string if a re.findall finds no match

I am writing a script to take scanned pdf files and convert them into lines of text to enter into a database. I use re.findall to get matches from a list of regular expressions to get certain values from the tesseract extracted strings. I am having trouble when a regular expression can't find a match I want it to return "Error." So I can see that there is a problem.

I have tried a handful of if/else statements but I can't seem to get any to notice the None value.

from wand.image import Image as Img
import ghostscript
from PIL import Image
import pytesseract
import re
import os

def get_text_from_pdf(pendingpdf,pendingimg):
    with Img(filename=pendingpdf, resolution=300) as img:
        img.compression_quality = 99
        img.save(filename=pendingimg)
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
    extractedtext = pytesseract.image_to_string(Image.open(pendingimg))
    os.unlink(pendingimg)
    return extractedtext

def get_results(vendor,extracted_string,results):
    for v in vendor:
        pattern = re.compile(v)
        for match in re.findall(pattern,extracted_string):
            if type(match) is str:
                results.append(match)
            else:
                results.append("Error")
    return results

pendingpdf = r'J:\TBHscan07022019090315001.pdf'
pendingimg = 'Test1.jpg'
aggind = ["^(\w+)(?:.+)\n+3600",
          "Ticket: (nonsensewordstothrowerror)",
          "Ticket: \d+\s([0-9|/]+)",
          "Product: (\w+.+)\n",
          "Quantity: ([\d\.]+)",
          "Truck (\w+)"]
vendor = aggind
extracted_string = get_text_from_pdf(pendingpdf,pendingimg)
results = []

print(get_results(vendor,get_text_from_pdf(pendingpdf,pendingimg),results))
like image 904
Matthew Keith Avatar asked Jan 23 '26 22:01

Matthew Keith


2 Answers

You could do this in a single line:

results += re.findall(pattern, extracted_string) or ["Error"]

BTW, you get no benefit from compiling the pattern inside the vendor loop because you're only using it once.

Your function could also return the whole search result using a single list comprehension:

return [m for v in vendor for m in re.findall(v, extracted_string) or ["Error"]]

It is a bit weird that you would actually want to modify AND return the results list being passed as parameter. This may produce some unexpected side effects when you use the function.

Your "Error" flag may appear several times in the result list, and given that each pattern may return multiple matches, it will be hard to determine which pattern failed to find a value.

If you only want to signal an error when none of the vendor patterns match, you could use the or ["Error"] trick on whole result:

return [m for v in vendor for m in re.findall(v, extracted_string)] or ["Error"]
like image 78
Alain T. Avatar answered Jan 25 '26 10:01

Alain T.


With such an approach for match in re.findall(pattern,extracted_string):
if re.findall(...) won't find any matches - the for loop won't even run.

Save the result of matching into a variable beforehand, then - check with condition:

...
matches = re.findall(pattern, extracted_string)
if not matches:
    results.append("Error")
else:
    for match in matches:
        results.append(match)

Note, when iterating through results of re.findall(...) the check if type(match) is str: won't make sense as each matched item is a string anyway (otherwise - a more sophisticated analysis of string's content could have been implied).

like image 38
RomanPerekhrest Avatar answered Jan 25 '26 12:01

RomanPerekhrest