如何通过xpdf或mupdf获取指定文本位置？

Question

如何通过xpdf或mupdf获取指定文本位置？

pdftextextractmupdfxpdf

3

我希望能够从pdf文件中提取特定的文本以及其位置。

我知道xpdf和mupdf可以解析pdf文件，所以我认为它们可以帮助我完成这个任务。

但是如何使用这两个库来获取文本位置呢？

- PDF1001

文本位置是什么意思？ - Dan D.

@DanD.文本位置指页面中的第一个字符位置。 - PDF1001

2个回答

1

Mupdf带有几个工具，其中之一是pdfdraw。

如果您使用-tt选项的pdfdraw，它将生成一个包含所有字符及其精确定位信息的XML。
从那里，您应该能够找到所需的内容。

- Robert

在更新的版本中，它被称为mudraw.c，其轨迹指向structured-text.h和stext-output.c，非常有帮助，谢谢。 - Andrew

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jorj McKie · Accepted Answer

如果您不介意使用MuPDF的Python绑定，这里有一个使用PyMuPDF的Python解决方案（我是其开发人员之一）：

import fitz                     # the PyMuPDF module
doc = fitz.open("input.pdf")    # PDF input file
page = doc[n]                   # page number n (0-based)
wordlist = page.getTextWords()  # gives you a list of all words on the
# page, together with their position info (a rectangle containing the word)

# or, if you only are interested in blocks of lines belonging together:
blocklist = page.getTextBlocks()

# If you need yet more details, use a JSON-based output, which also gives
# images and their positions, as well as font information for the text.
tdict = json.loads(page.getText("json"))

如果您感兴趣，我们在GitHub上有相关内容。