I have a bunch of PDF files and my Perl program needs to do a full-text search of them to return which ones contain a specific string. To date I have been using this:
my @search_results = `grep -i -l \"$string\" *.pdf`;
where $string is the text to look for. However this fails for most pdf's because the file format is obviously not ASCII.
What can I do that's easiest?
Clarification: There are about 300 pdf's whose name I do not know in advance. PDF::Core is probably overkill. I am trying to get pdftotext and grep to play nice with each other given I don't know the names of the pdf's, I can't find the right syntax yet.
Final solution using Adam Bellaire's suggestion below:
@search_results = `for i in \$( ls ); do pdftotext \$i - | grep --label="\$i" -i -l "$search_string"; done`;
This can be done in multiple ways as per the user's requirement. Searching in Perl follows the standard format of first opening the file in the read mode and further reading the file line by line and then look for the required string or group of strings in each line.
Simple word matching In this statement, World is a regex and the // enclosing /World/ tells Perl to search a string for a match. The operator =~ associates the string with the regex match and produces a true value if the regex matched, or false if the regex did not match.
The PerlMonks thread here talks about this problem.
It seems that for your situation, it might be simplest to get pdftotext (the command line tool), then you can do something like:
my @search_results = `pdftotext myfile.pdf - | grep -i -l \"$string\"`;
My library, CAM::PDF, has support for extracting text, but it's an inherently hard problem given the graphical orientation of PDF syntax. So, the output is sometimes gibberish. CAM::PDF bundles a getpdftext.pl program, or you can invoke the functionality like so:
my $doc = CAM::PDF->new($filename) || die "$CAM::PDF::errstr\n";
for my $pagenum (1 .. $doc->numPages()) {
   my $text = $doc->getPageText($pagenum);
   print $text;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With