I am using tika-app jar for my project and is there a way to disable tesseract OCR in tika. There are two things which has to be kept as such:
1.tesseract cannot be uninstalled
2.tika.xml can't be edited, as tika-app.jar is used off the shelf
Is there a way to set the configuration in the java code by setting the context or parser property to disable OCR?
I tried the below code but still OCR extracts the text from image files while parsing.
            PDFParserConfig pdfConfig = new PDFParserConfig();
            pdfConfig.setOcrStrategy(OCR_STRATEGY.NO_OCR);
            context.set(PDFParserConfig.class, pdfConfig);```
<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
       <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
        </parser>
    </parsers>
</properties>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With