I would like to ask a question about iText. I am facing a problem searching for text in a PDF file.
I can create a plain text file using the getTextfromPage()
method as described in the following code sample:
/** The original PDF that will be parsed. */
public static final String PREFACE = "D:/B.pdf";
/** The resulting text file. */
public static final String RESULT = "D:/Result.txt";
public void ParsePDF(String From, String Destination) throws IOException{
PdfReader reader = new PdfReader(PREFACE);
PrintWriter out = new PrintWriter(new FileOutputStream(RESULT));
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
out.println(PdfTextExtractor.getTextFromPage(reader, i));
}
out.flush();
out.close();
reader.close();
}
I'm trying to find a specific String
in the resulting text like this:
public void FindWords(String From) {
try{
String ligneLue;
LineNumberReader lnr=new LineNumberReader(new FileReader(RESULT));
try{
while((ligneLue=lnr.readLine())!=null){
SearchForSVHC(ligneLue,SvhcList);
}
}
finally{
lnr.close();
}
}
catch(IOException e){
System.out.println(e);}
}
public void SearchForSVHC(String Ligne,List<String> List){
for(String CAS :List){
if(Ligne.contains(CAS)){
System.out.print("Yes "+CAS);
break;
}}
}
My problem is that some PDFs I'm parsing consist of scanned images, which means that there is no real text, just pixels.
Does iText support Optical Character Recognition (OCR) and as a follow-up question: is there a way to determine if a PDF consists of scanned images?
I've done a very thorough edit of your question before answering it.
When a PDF consists of scanned images, there is no real text to parse, there are just images with pixels that look like text. You'd need to do OCR to know what is actually written on such a scanned page, and iText doesn't support OCR.
Regarding the follow-up question: it's very hard to find out if a PDF contains scanned images. A first give-away would be: there's only an image in the page, and there's no text.
However: as you don't know anything about the nature of the images (maybe you have a PDF containing nothing but holiday photos), it's very hard to find out if the PDF is a document full of scanned pages of text (that is: rasterized text).
As of today, iText does have an OCR product, which uses Tesseract 4.x. You can get all of its documentation on their Knowledge Base.
Here's a quick example listed over there, on how to OCR an image into a PDF/A-3u file.
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.pdfocr.OcrPdfCreator;
import com.itextpdf.pdfocr.tesseract4.Tesseract4LibOcrEngine;
import com.itextpdf.pdfocr.tesseract4.Tesseract4OcrEngineProperties;
import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
public class JDoodle {
private static List LIST_IMAGES_OCR = Arrays.asList(new File("invoice_front.jpg"));
private static String OUTPUT_PDF = "/myfiles/hello.pdf";
private static final String DEFAULT_RGB_COLOR_PROFILE_PATH = "profiles/sRGB_CS_profile.icm";
public static void main(String[] args) throws IOException {
OcrPdfCreatorProperties properties = new OcrPdfCreatorProperties();
properties.setPdfLang("en"); //we need to define a language to make it PDF/A compliant
OcrPdfCreator ocrPdfCreator = new OcrPdfCreator(new Tesseract4LibOcrEngine(new Tesseract4OcrEngineProperties()), properties);
try (PdfWriter writer = new PdfWriter(OUTPUT_PDF)) {
ocrPdfCreator.createPdfA(LIST_IMAGES_OCR, writer, getRGBPdfOutputIntent()).close();
}
}
public static PdfOutputIntent getRGBPdfOutputIntent() throws FileNotFoundException {
InputStream is = new FileInputStream(DEFAULT_RGB_COLOR_PROFILE_PATH);
return new PdfOutputIntent("", "",
"", "sRGB IEC61966-2.1", is);
}
}
It's coming late, but I hope it helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With