如何使用CAM::PDF无损删除PDF中的所有图像？

Question

如何使用CAM::PDF无损删除PDF中的所有图像？

3

以下脚本使用 CAM::PDF 可以删除 PDF 文件中的所有图片。然而，输出文件已损坏。 PDF 阅读器仍然可以打开它，但会报告错误。例如，mupdf 会输出以下信息：

error: no XObject subtype specified
error: cannot draw xobject/image
warning: Ignoring errors during rendering
mupdf: warning: Errors found on page

现在，CPAN上的页面（此处链接）将方法列在“更深层次的实用工具”下面，这可能意味着它不适用于公共使用。此外，它还警告说：

该函数不会处理此对象的依赖关系。

我的问题是：使用删除PDF文件中的对象的正确方法是什么？如果问题与依赖关系有关，则如何在处理其依赖关系的同时删除对象？

关于如何使用其他工具从PDF中删除图像，请参阅相关问题此处。

use CAM::PDF;    
my $pdf = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

foreach my $objnum ( sort { $a <=> $b } keys %{ $pdf->{xref} } ) {
  my $xobj = $pdf->dereference ( $objnum );

  if ( $xobj->{value}->{type} eq 'dictionary' ) {
    my $im = $xobj->{value}->{value};
    if
    (
      defined $im->{Type} and defined $im->{Subtype}
      and $pdf->getValue ( $im->{Type}    ) eq 'XObject'
      and $pdf->getValue ( $im->{Subtype} ) eq 'Image'
    )
    {
      $pdf->deleteObject ( $objnum );
    }
  }
}

$pdf->cleanoutput ( '-' );

- n.r.

你是否有一个损坏的PDF文件，可以提供mupdf错误？我正在调试类似的问题，这将非常有帮助 :) - Darajan

2个回答

2

另一种真正删除图像的方法是：

在资源列表中查找并删除图像XObjects，
保留一个已删除资源的名称数组，
在每个页面内容中将相应的Do操作符替换为同长度的空格，
清理并打印。

请注意，dwarring的方法更安全，因为它不必在最后调用$doc->cleanse。根据CAM::PDF文档(此处)，cleanse方法会

移除未使用的对象。警告：此函数会破坏某些PDF文档，因为它会删除严格属于页面模型层次结构但仍然需要的对象(如某些字体定义对象)。

我不知道使用cleanse会有多大问题。

use CAM::PDF;
my $doc = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

# delete image XObjects among resources
# but keep their names

my @names;

foreach my $objnum ( sort { $a <=> $b } keys %{ $doc->{xref} } ) {
  my $obj = $doc->dereference( $objnum );
  next unless $obj->{value}->{type} eq 'dictionary';

  my $n = $obj->{value}->{value};

  my $resources = $doc->getValue ( $n->{Resources}       ) or next;
  my $resource  = $doc->getValue ( $resources->{XObject} ) or next;

  foreach my $name ( sort keys $resource ) {
    my $im = $doc->getValue ( $resource->{$name} ) or next;

    next unless defined $im->{Type}
            and defined $im->{Subtype}
            and $doc->getValue ( $im->{Type}    ) eq 'XObject'
            and $doc->getValue ( $im->{Subtype} ) eq 'Image';

    delete $resource->{$name};                                                                                                           
    push @names, $name;                                                                                                                  
  }                                                                                                                                      
}                                                                                                                                        


# delete the corresponding Do operators                                                                                                                        

if ( @names ) {                                                                                                                                                               
  foreach my $p ( 1 .. $doc->numPages ) {                                                                                                                                     
    my $content = $doc->getPageContent ( $p );
    my $s;
    foreach my $name ( @names ) {
      ++$s if $content =~ s{( / \Q$name\E \s+ Do \b )} { ' ' x length $1 }xeg;
    }
    $doc->setPageContent ( $p, $content ) if $s;
  }
}

$doc->cleanse;
$doc->cleanoutput;

- n.r.

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- dwarring · Accepted Answer

这里使用了CAM::PDF，但采用了略微不同的方法。与其试图删除图片（相当困难），它将每个图片替换为透明的图片。

首先，需要注意我们可以使用ImageMagick生成一个仅包含透明图片的空白PDF：

% convert  -size 200x100 xc:none transparent.pdf

如果我们在文本编辑器中查看生成的PDF文件，我们可以找到主要的图像对象：

8 0 obj
<<
/Type /XObject
/Subtype /Image
/Name /Im0
...

重要的一点是我们已经生成了一个透明图像作为第8个对象。然后导入该对象，并使用它替换PDF中的每个真实图像，有效地使它们变成空白。

use warnings; use strict;
use CAM::PDF;    
my $pdf = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

my $trans_pdf = CAM::PDF->new("transparent.pdf") || die "$CAM::PDF::errstr\n";
my $trans_objnum = 8; # object number of transparent image

foreach my $objnum ( sort { $a <=> $b } keys %{ $pdf->{xref} } ) {
  my $xobj = $pdf->dereference ( $objnum );

  if ( $xobj->{value}->{type} eq 'dictionary' ) {
    my $im = $xobj->{value}->{value};
    if
    (
      defined $im->{Type} and defined $im->{Subtype}
      and $pdf->getValue ( $im->{Type}    ) eq 'XObject'
      and $pdf->getValue ( $im->{Subtype} ) eq 'Image'
    ) {
        $pdf->replaceObject ( $objnum, $trans_pdf, $trans_objnum, 1 );
    }
  }
}

$pdf->cleanoutput ( '-' );

该脚本现在将PDF中的每个图像替换为导入的透明图像对象（来自transparent.pdf中的对象编号8）。