两个PDF文件的比较

Question

两个PDF文件的比较

pdfcomparisonpdfbox

10

我需要比较两个几乎相似的文件的内容，并在相应的PDF文件中突出显示不同的部分。我正在使用pdfbox。请至少帮助我理清逻辑。

我需要比较两个几乎相似的文件的内容，并在相应的PDF文件中突出显示不同的部分。我使用的是pdfbox。请帮我确定逻辑。

- Aisharjya Sarkar

5个回答

5

你可以在Linux上使用shell脚本完成相同的操作。该脚本包含以下三个组件：

ImageMagick的compare命令
pdftk实用程序
Ghostscript

将其转换为DOS/Windows下的.bat批处理文件也非常容易...

以下是构建块：

pdftk

使用此命令将多页PDF文件拆分为多个单页PDF：

pdftk  first.pdf  burst  output  somewhere/firstpdf_page_%03d.pdf
pdftk  2nd.pdf    burst  output  somewhere/2ndpdf_page_%03d.pdf

比较

使用此命令为每个页面创建一个“差异”PDF页面：

compare \
       -verbose \
       -debug coder -log "%u %m:%l %e" \
        somewhere/firstpdf_page_001.pdf \
        somewhere/2ndpdf_page_001.pdf \
       -compose src \
        somewhereelse/diff_page_001.pdf

请注意，compare是ImageMagick的一部分。但是对于PDF处理，它需要Ghostscript作为'代理'，因为它本身无法进行该操作。

再次使用pdftk

现在，您可以使用pdftk将“差异”PDF页面连接起来：

pdftk \
      somewhereelse/diff_page_*.pdf \
      cat \
      output somewhereelse/diff_allpages.pdf

Ghostscript

Ghostscript会自动将元数据（例如当前日期+时间）插入到其PDF输出中。因此，这对于基于MD5哈希值的文件比较来说并不起作用。

如果您想自动发现所有仅由纯白页面组成的情况（即：您的输入页面中没有可见差异），您还可以使用bmp256输出设备将其转换为无元数据的位图格式。您可以对原始PDF（first.pdf和2nd.pdf）或差异PDF页面进行操作：

 gs \
   -o diff_page_001.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
    diff_page_001.pdf

 md5sum diff_page_001.bmp

只需创建一个全白的BMP页面，并记录其MD5sum（供参考），如下所示：

 gs \
   -o reference-white-page.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
   -c "showpage quit"

 md5sum reference-white-page.bmp

- Kurt Pfeifle

以下是使用ImageMagick和Poppler工具（为了提高速度）逐页视觉差异比较两个PDF的脚本：https://gist.github.com/brechtm/891de9f72516c1b2cbc1。它会在“pdfdiff”目录中输出每个PDF页面的一个JPG，并额外打印出两个PDF之间不同的页面编号。 - Brecht Machiels

4

我自己也遇到过这个问题，最快的解决方法是使用PHP及其与ImageMagick（Imagick）的绑定。

<?php
$im1 = new \Imagick("file1.pdf");
$im2 = new \Imagick("file2.pdf");

$result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);

if($result[1] > 0.0){
    // Files are DIFFERENT
}
else{
    // Files are IDENTICAL
}

$im1->destroy();
$im2->destroy();

当然，你需要先安装ImageMagick绑定：

sudo apt-get install php5-imagick # Ubuntu/Debian

- paul.ago

1

我需要安装Ghostscript。 - snapshot

这是正确的解决方案。 - Luciano Fantuzzi

0

我使用Apache PDFBox开发了一个JAR包，可以比较PDF文件 - 它可以逐像素比较并突出显示差异。

请查看我的博客：http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ 以获取示例和下载链接。

获取页面数量

import com.taguru.utility.PDFUtil;

PDFUtil pdfUtil = new PDFUtil();
pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count

获取页面内容的纯文本

//returns the pdf content - all pages
pdfUtil.getText("c:/sample.pdf");

// returns the pdf content from page number 2
pdfUtil.getText("c:/sample.pdf",2);

// returns the pdf content from page number 5 to 8
pdfUtil.getText("c:/sample.pdf", 5, 8);

从PDF中提取附加的图像

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.extractImages("c:/sample.pdf");

// extracts & saves the pdf content from page number 3
pdfUtil.extractImages("c:/sample.pdf", 3);

// extracts & saves the pdf content from page 2
pdfUtil.extractImages("c:/sample.pdf", 2, 2);

将PDF页面存储为图像

//set the path where we need to store the images
 pdfUtil.setImageDestinationPath("c:/imgpath");
 pdfUtil.savePdfAsImage("c:/sample.pdf");

以文本模式比较PDF文件（速度更快-但不会比较PDF中的格式、图像等）

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesTextMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesTextMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesTextMode(file1, file2, 1, 5);

以二进制模式比较PDF文件（速度较慢 - 逐像素比较PDF文档 - 高亮显示PDF差异并将结果存储为图像）

String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";

// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

// compare the 3rd page alone
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 3, 3);

// compare the pages from 1 to 5
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 1, 5);

//if you need to store the result
pdfUtil.highlightPdfDifference(true);
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

- vins

当我尝试下载该文件时，出现了错误：“传输的文件包含病毒，因此被阻止。 URL：http://www.testautomationguru.com/download/304/ 媒体类型：application/java-vm 病毒名称：McAfeeGW: BehavesLike.Java.Suspicious.xm” - scc

我尝试从上述网站运行jar文件，但是出现了错误，如“taguru-pdf-util.jar中没有主清单属性”，你能帮我解决一下吗？ - Nachiappan R

0

在 macOS Monterey（即版本12）上比较PDF文件，我使用Homebrew安装了diff-pdf并运行它。 --view选项对我无效，但--output-diff有效。

- Greg Sadetsky

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Kurt Pfeifle · Accepted Answer

7

如果您喜欢带有图形用户界面的工具，您可以尝试这个：diffpdf。它是由Mark Summerfield开发的，并且由于使用了Qt编写，因此应该可以在Qt支持的所有平台上使用（或者应该可以构建）。下面是屏幕截图： enter image description here

。

- Kurt Pfeifle

你能否在CLI上使用它，跳过GUI并将输出直接重定向到文件中？ - caw

@caw：（1）你看到我的另一个答案了吗？--（2）据我所知，新版本的DiffPDF可以将输出重定向到CSV文件。但我不知道这是否完全跳过了GUI。--（3）有一个名为DiffPDFc的“纯CLI”版本可用，可以在www.qtrac.eu找到--但它仅适用于Windows。 - Kurt Pfeifle

我以前没有尝试过这种组合，但之前尝试过ImageMagick、pdftk和Ghostscript。由于diffpdf的结果非常好，事实上是优秀的，我希望所有这些已经存在的功能都可以直接在CLI中重定向到PDF。真可惜！感谢您提供有关该工具其他版本的信息。不幸的是，新版本不再是开源的，而且仅限于Windows也不完美。 - caw