Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to delete all images from a PDF without corrupting it using CAM::PDF?

Tags:

pdf

perl

cam-pdf

The script below is able to remove all images from a PDF file using CAM::PDF. The output, however, is corrupt. PDF readers are nonetheless able to open it, but they complain about errors. For instance, mupdf says:

error: no XObject subtype specified
error: cannot draw xobject/image
warning: Ignoring errors during rendering
mupdf: warning: Errors found on page

Now, CAM::PDF page at CPAN (here) lists the deleteObject() method under "Deeper utilities", presumably meaning that it's not intended for public usage. Moreover, it warns that:

This function does NOT take care of dependencies on this object.

My question is: what is the right way to remove objects from a PDF file using CAM::PDF? If the issue has to do with dependencies, how can I remove an object while taking care of its dependencies?

For how to remove images from a PDF using other tools, see a related question here.

use CAM::PDF;    
my $pdf = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

foreach my $objnum ( sort { $a <=> $b } keys %{ $pdf->{xref} } ) {
  my $xobj = $pdf->dereference ( $objnum );

  if ( $xobj->{value}->{type} eq 'dictionary' ) {
    my $im = $xobj->{value}->{value};
    if
    (
      defined $im->{Type} and defined $im->{Subtype}
      and $pdf->getValue ( $im->{Type}    ) eq 'XObject'
      and $pdf->getValue ( $im->{Subtype} ) eq 'Image'
    )
    {
      $pdf->deleteObject ( $objnum );
    }
  }
}

$pdf->cleanoutput ( '-' );
like image 657
n.r. Avatar asked Dec 08 '25 04:12

n.r.


1 Answers

This uses CAM::PDF, but takes a slightly different approach. Rather than attempting to delete the images, which is pretty hard, it replaces each image with a transparent image.

Firstly, note that we can use image magick to generate a blank PDF that contains nothing but a transparent image:

% convert  -size 200x100 xc:none transparent.pdf

If we view the generated PDF in a text editor, we can find the main image object:

8 0 obj
<<
/Type /XObject
/Subtype /Image
/Name /Im0
...

The important thing to note here is that we have generated a transparent image as object number 8.

It then becomes matter of importing this object, and using it to replace each of the real images in the PDF, effectively blanking them.

use warnings; use strict;
use CAM::PDF;    
my $pdf = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

my $trans_pdf = CAM::PDF->new("transparent.pdf") || die "$CAM::PDF::errstr\n";
my $trans_objnum = 8; # object number of transparent image

foreach my $objnum ( sort { $a <=> $b } keys %{ $pdf->{xref} } ) {
  my $xobj = $pdf->dereference ( $objnum );

  if ( $xobj->{value}->{type} eq 'dictionary' ) {
    my $im = $xobj->{value}->{value};
    if
    (
      defined $im->{Type} and defined $im->{Subtype}
      and $pdf->getValue ( $im->{Type}    ) eq 'XObject'
      and $pdf->getValue ( $im->{Subtype} ) eq 'Image'
    ) {
        $pdf->replaceObject ( $objnum, $trans_pdf, $trans_objnum, 1 );
    }
  }
}

$pdf->cleanoutput ( '-' );

The script now replaces each image in the PDF with the imported transparent image object(object number 8 from transparent.pdf).

like image 155
dwarring Avatar answered Dec 10 '25 21:12

dwarring