BASH脚本检查PDF文件是否OCR完成

Question

BASH脚本检查PDF文件是否OCR完成

linuxbashpdfxpdf

3

我不知道该从哪里开始翻译这个问题。

我有一个Linux服务器，上面有8000多个PDF文件，需要知道哪些PDF已经进行了OCR处理，哪些没有。

我考虑使用一些脚本来调用XPDF来检查PDF文件，但老实说我不确定是否可行。

非常感谢您的帮助。

- Grimlockz

如何判断一个文件是否已经进行了 OCR（文字识别）处理？是否会生成一个类似于 file1.pdf.ocr 的输出文件呢？祝好运。 - shellter

这可能会对你有所帮助：https://dev59.com/0m025IYBdhLWcg3wZlLe - potong

那么你想要区分哪些是文本，哪些是包含文本的图像吗？在这种情况下，你可以尝试使用 pdftotext 并查看它是否产生任何输出。 - ninjalj

2个回答

4

请确保您已安装命令行工具pdffonts。（有两个版本：一个是作为xpdf-utils的一部分提供，另一个是作为poppler-utils的一部分提供。）

所有只包含扫描页面的PDF文件将不使用任何字体（无嵌入字体或未嵌入字体）。

命令行：

pdffonts /path/to/scanned.pdf

如果文件中没有字体信息，则将不会显示任何字体信息。

这可能已经足够让您将文件分成两组不同的文件集。

如果您的PDF文件包含扫描页面和“正常”页面（或是扫描并进行了OCR处理的页面），那么您需要扩展和完善上述简单方法。有关更多信息，请参见man pdffonts或pdffonts --help。

- Kurt Pfeifle

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nathaniel M. Beaver · Accepted Answer

pdffonts 的问题在于有时会返回空值，就像这样：

name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------

有时它会返回以下内容：

name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
[none]                               Type 3            yes no  no     266  0
[none]                               Type 3            yes no  no       9  0
[none]                               Type 3            yes no  no     297  0
[none]                               Type 3            yes no  no     341  0
[none]                               Type 3            yes no  no     381  0
[none]                               Type 3            yes no  no     394  0
[none]                               Type 3            yes no  no     428  0
[none]                               Type 3            yes no  no     441  0
[none]                               Type 3            yes no  no     451  0
[none]                               Type 3            yes no  no     480  0
[none]                               Type 3            yes no  no     492  0
[none]                               Type 3            yes no  no     510  0
[none]                               Type 3            yes no  no     524  0
[none]                               Type 3            yes no  no     560  0
[none]                               Type 3            yes no  no     573  0
[none]                               Type 3            yes no  no     584  0
[none]                               Type 3            yes no  no     593  0
[none]                               Type 3            yes no  no     601  0
[none]                               Type 3            yes no  no     644  0

基于这个想法，让我们编写一个小工具来从PDF中获取所有字体：

pdffonts my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq

如果您的pdf没有经过OCR处理，此命令将不会输出任何结果，或者输出[none]。

如果您想让它运行更快，可以使用-l选项只分析前5页：

pdffonts -l 5 my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq

现在将其放入一个bash脚本中，例如is-pdf-ocred.sh:

#!/bin/bash
MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
    echo "NOT OCR'ed: $1"
else 
    echo "$1 is OCR'ed."
fi

最后，我们希望能够搜索pdf文件。 find 命令不知道你在 .bashrc 中定义的别名或函数，因此我们需要给它脚本的路径。在所选目录中运行如下命令：

find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \;

我假设PDF文件的扩展名为.pdf，尽管这不是您总是可以做出的假设。您可能希望将其导入到less中或将其输出到文本文件中：

find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \; | less
find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \; > pdfs.txt

使用-l 5标志，我可以在10秒多一点的时间内完成大约200个pdf文件。