I want to compare two pdf documents (not only contents but also other information such as header footers and styles).
I got to know that we can use Apache tika for comparison purpose. I have learnt to parse the pdf document and able to extract some metadata info such as title, author.
I'm right now able to do like this -
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
public class CompareDocs {
public CompareDocs() {
super();
}
private void parseResource(String resourceName) {
System.out.println("Parsing resource : " + resourceName);
InputStream inputStream = null;
try {
try {
inputStream = new BufferedInputStream(new FileInputStream(new File(resourceName)));
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
Parser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(inputStream, contentHandler, metadata, new ParseContext());
for (String name : metadata.names()) {
String value = metadata.get(name);
System.out.println("Metadata Name: " + name);
System.out.println("Metadata Value: " + value);
}
System.out.println("Title: " + metadata.get("title"));
System.out.println("Author: " + metadata.get("Author"));
System.out.println("content: " + contentHandler.toString());
} catch (IOException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} finally {
if (inputStream != null) {
try {
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) throws Exception {
CompareDocs apacheTikaParser = new CompareDocs();
apacheTikaParser.parseResource("C:\\Users\\prakhar\\Desktop\\beautiful_code.pdf");
}
}
How can we extract some more information such as header distance of first section, image height and width etc and compare these with another pdf using Apache Tika.
Tika detects and extracts metadata and structured text content. It doesn't support to find header distance of first section, image height and width etc.
You can try PDFBox or Itext.
If you want access to more information, maybe it is more convenient to use another api like PDFTextStream. Tika extracts raw textual information from a pdf, while PDFTextStream gives you structured text with correlated info such as character encoding, height, region of the text etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With