Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OCR for detecting bills [closed]

I am planning on creating a mobile app which can scan a bill/invoice generated by shop and extract key features from it like shop name, address, items purchased, bill value etc. I understand i can use OCR to extract text from the bill (a scanned bill or a photo of the bill) but then how would i extract all these details? What approach to use?

like image 379
user3807940 Avatar asked Dec 07 '25 17:12

user3807940


2 Answers

Well, for the app you are trying to build will have 4 stages

Data Extraction - System should be able to extract text data stored in file formats like DOC , PPT and PDF. System should also be able to extract Data from Images.

Data Identification – Next step to Data Extraction would be identifying data on the basis of user defined patterns.

Data Classification – Classify in user defined categories.

Data Handing – Perform different actions based on the Category of Data Identified in this process.

You are correct - need to work on OCR i.e Optical Character Recognition

OCR is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document.

Also there are lot of solutions available in market for same be it commercial products or libraries..

Commercial Products :

Google Docs (Free)

ABBYY FineReader Pro (Paid)

OmniPage Standar (Paid)

Readiris Pro (Paid)

But if you still want to build your own product for it you can use TESSERACT-OCR - you can build your app using Java/Python . Tesseract is the most accurate open source OCR Engine available.

It is combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages.

Also you need to use APACHE TIKA - Apache Tika is a library that is used for document type detection and content extraction from various file formats.

Internally, Tika uses various existing document parsers and document type detection techniques to detect and extract data.

Using Tika, one can develop a universal type detector and content extractor to extract both structured text as well as metadata from different types of documents such as spreadsheets, text documents, images, PDFs and even multimedia input formats to a certain extent.

Tika provides a single generic API for parsing different file formats. It uses 83 existing specialized parser libraries for each document type.

All these parser libraries are encapsulated under a single interface called the Parser interface.. 

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

It will be good to use TIKA Server and Tesseract OCR all together.

This all together will include integration with Google's Tensorflow Image Recognition via the Inception API, improvements to PDF parsing using OCR, message parsing and MIME detection

Google Vision API - If you are building your solution using Google Cloud Platform

Google Vision API supports most of the image formats used on the Web, including GIF, BMP, WebP, Raw, Ico, etc

Tests have not revealed any performance or quality issues based on the image format, although lossy formats such as JPEG might show worse results at very low resolutions (i.e. below 1MP).

Google Cloud Vision only accepts files stored on Google Cloud Storage

Vision’s batch processing support is limited to 8MB per request. Therefore, a relatively large dataset of 1,000 modern images might easily require more than 200 batch requests. 

Conclusion

For best result, Apache TIKA must be used all together with TESSERACT OCR which would be open-source solution and Costing would be 0 i.e Zero.

But if OCR is they key feature and looking for something reliable i.e Google Vision API which is again more featured , accurate and faster than others.

Yes, it will include cost and it will be counted as Paid Solution.

like image 160
Mayank Sethi Avatar answered Dec 11 '25 21:12

Mayank Sethi


The best option that I have found, which is also free, and works with most programming languages (C#, Java, Objective-C, Ruby, PHP, etc.) is Cloudmersive OCR:

It can automatically identify a document, receipt or bill within a photo, and then automatically extract the text very reliably.

I’m using it right now on a business application in production, working pretty well so far.

like image 34
Johnny Avatar answered Dec 11 '25 22:12

Johnny



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!