How to compare two pdf documents using Apache Tika

Question

I want to compare two pdf documents (not only contents but also other information such as header footers and styles).

I got to know that we can use Apache tika for comparison purpose. I have learnt to parse the pdf document and able to extract some metadata info such as title, author.

I'm right now able to do like this -

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

public class CompareDocs {
    public CompareDocs() {
        super();
    }

    private void parseResource(String resourceName) {  
            System.out.println("Parsing resource : " + resourceName);  
            InputStream inputStream = null;  

            try {  
                try {
                        inputStream = new BufferedInputStream(new FileInputStream(new File(resourceName)));   
                    } catch (FileNotFoundException e) {
                        e.printStackTrace();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }



                Parser parser = new AutoDetectParser();  
                ContentHandler contentHandler = new BodyContentHandler();  
                Metadata metadata = new Metadata();  

                parser.parse(inputStream, contentHandler, metadata, new ParseContext());  

                for (String name : metadata.names()) {  
                    String value = metadata.get(name);  
                    System.out.println("Metadata Name: " + name);  
                    System.out.println("Metadata Value: " + value);  
                }  

                System.out.println("Title: " + metadata.get("title"));  
                System.out.println("Author: " + metadata.get("Author"));  
                System.out.println("content: " + contentHandler.toString());  

            } catch (IOException e) {  
                e.printStackTrace();  
            } catch (TikaException e) {  
                e.printStackTrace();  
            } catch (SAXException e) {  
                e.printStackTrace();  
            } finally {  
                if (inputStream != null) {  
                    try {  
                        inputStream.close();  
                    } catch (IOException e) {  
                        e.printStackTrace();  
                    }  
                }  
            }  
        }  

    public static void main(String[] args) throws Exception {
        CompareDocs apacheTikaParser = new CompareDocs();  
               apacheTikaParser.parseResource("C:\Users\prakhar\Desktop\beautiful_code.pdf");  
    }
}

How can we extract some more information such as header distance of first section, image height and width etc and compare these with another pdf using Apache Tika.

SANN3 · Accepted Answer

Tika detects and extracts metadata and structured text content. It doesn't support to find header distance of first section, image height and width etc.

You can try PDFBox or Itext.

yeaaaahhhh..hamf hamf · Answer

If you want access to more information, maybe it is more convenient to use another api like PDFTextStream. Tika extracts raw textual information from a pdf, while PDFTextStream gives you structured text with correlated info such as character encoding, height, region of the text etc.

How to compare two pdf documents using Apache Tika

Tags:

java

pdf

apache

apache-tika

unknown_boundaries

2 Answers

SANN3

yeaaaahhhh..hamf hamf

Recent Activity

Donate For Us

How to compare two pdf documents using Apache Tika

Tags:

java

pdf

apache

apache-tika

unknown_boundaries

2 Answers

SANN3

yeaaaahhhh..hamf hamf

Related questions

Recent Activity

Donate For Us