Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare two pdf documents using Apache Tika

I want to compare two pdf documents (not only contents but also other information such as header footers and styles).

I got to know that we can use Apache tika for comparison purpose. I have learnt to parse the pdf document and able to extract some metadata info such as title, author.

I'm right now able to do like this -

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

public class CompareDocs {
    public CompareDocs() {
        super();
    }

    private void parseResource(String resourceName) {  
            System.out.println("Parsing resource : " + resourceName);  
            InputStream inputStream = null;  

            try {  
                try {
                        inputStream = new BufferedInputStream(new FileInputStream(new File(resourceName)));   
                    } catch (FileNotFoundException e) {
                        e.printStackTrace();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }



                Parser parser = new AutoDetectParser();  
                ContentHandler contentHandler = new BodyContentHandler();  
                Metadata metadata = new Metadata();  

                parser.parse(inputStream, contentHandler, metadata, new ParseContext());  

                for (String name : metadata.names()) {  
                    String value = metadata.get(name);  
                    System.out.println("Metadata Name: " + name);  
                    System.out.println("Metadata Value: " + value);  
                }  

                System.out.println("Title: " + metadata.get("title"));  
                System.out.println("Author: " + metadata.get("Author"));  
                System.out.println("content: " + contentHandler.toString());  

            } catch (IOException e) {  
                e.printStackTrace();  
            } catch (TikaException e) {  
                e.printStackTrace();  
            } catch (SAXException e) {  
                e.printStackTrace();  
            } finally {  
                if (inputStream != null) {  
                    try {  
                        inputStream.close();  
                    } catch (IOException e) {  
                        e.printStackTrace();  
                    }  
                }  
            }  
        }  

    public static void main(String[] args) throws Exception {
        CompareDocs apacheTikaParser = new CompareDocs();  
               apacheTikaParser.parseResource("C:\\Users\\prakhar\\Desktop\\beautiful_code.pdf");  
    }
}

How can we extract some more information such as header distance of first section, image height and width etc and compare these with another pdf using Apache Tika.

like image 484
unknown_boundaries Avatar asked Dec 27 '25 15:12

unknown_boundaries


2 Answers

Tika detects and extracts metadata and structured text content. It doesn't support to find header distance of first section, image height and width etc.

You can try PDFBox or Itext.

like image 59
SANN3 Avatar answered Dec 31 '25 05:12

SANN3


If you want access to more information, maybe it is more convenient to use another api like PDFTextStream. Tika extracts raw textual information from a pdf, while PDFTextStream gives you structured text with correlated info such as character encoding, height, region of the text etc.

like image 42
yeaaaahhhh..hamf hamf Avatar answered Dec 31 '25 07:12

yeaaaahhhh..hamf hamf



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!