使用Python是否可以获取每个单词的边界框？

Question

使用Python是否可以获取每个单词的边界框？

7

我知道

pdftotext -bbox foobar.pdf

创建一个包含类似内容的HTML文件。

<word xMin="301.703800" yMin="104.483700" xMax="309.697000" yMax="115.283700">is</word>
<word xMin="313.046200" yMin="104.483700" xMax="318.374200" yMax="115.283700">a</word>
<word xMin="321.603400" yMin="104.483700" xMax="365.509000" yMax="115.283700">universal</word>
<word xMin="368.858200" yMin="104.483700" xMax="384.821800" yMax="115.283700">file</word>
<word xMin="388.291000" yMin="104.483700" xMax="420.229000" yMax="115.283700">format</word>

因此，每个单词都有一个边界框。

相比之下，Python软件包PDFminer似乎只能提供文本块的位置（请参见example）。

我该如何在Python中获取每个单词的边界框？

- Martin Thoma

@KJ 你是什么意思？ - Martin Thoma

PyPDF2现在可以使用访问者函数来完成此操作：https://pypdf2.readthedocs.io/en/latest/user/extract-text.html - Martin Thoma

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Joris Schellekens · Accepted Answer

声明: 我是borb的作者，这个包在此解答中使用。

要对单词进行边界框处理，您需要进行某种处理。问题是，PDF（最坏的情况）仅包含渲染指令，而不包含结构信息。

简单地说，您的PDF可能包含以下内容（伪代码）：

移动到90, 700
将活动字体设置为Helvetica，大小为12
将活动颜色设置为黑色
在活动字体中呈现“Hello World”

问题是第3个指令可能包含从

一个字母
多个字母
一个单词
至多多个单词

为了检索单词的边界框，您需要进行一些处理（如前所述）。您需要呈现这些指令并将文本（最好在呈现时）拆分为单词。

然后只需要跟踪海龟的坐标即可开始。

borb会在幕后为您完成此操作。

from borb.pdf import PDF
from borb.toolkit import RegularExpressionTextExtraction

# this line builds a RegularExpressionTextExtraction
# this class listens to rendering instructions 
# and performs the logic I mentioned in the text part of this answer
l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("[^ ]+")

# now we can load the file and perform our processing
with open("input.pdf", "rb") as fh:
    PDF.loads(fh, [l])

# now we just need to get the boxes out of it
# RegularExpressionTextExtraction returns a list of type PDFMatch
# this class can return a list of bounding boxes (should your
# regular expression ever need to be matched over separate lines of text)
for m in l.get_matches_for_page(0):
    # here we just print the Rectangle
    # but feel free to do something useful with it
    print(m.get_bounding_boxes()[0])

borb是一个开源的纯Python PDF库，可用于创建、修改和读取PDF文档。你可以使用以下命令进行下载：

pip install borb

或者，您可以通过分叉/下载GitHub存储库来构建源代码。