在Python中从PDF元数据中提取关键词

Question

在Python中从PDF元数据中提取关键词

3

我有一个PDF文件，想要从元数据中获取一些信息。为此，我按照以下步骤进行：

from PyPDF2 import PdfFileReader    
mypath = "your_pdf_file.pdf"
pdf_toread = PdfFileReader(open(mypath, 'rb'))
pdf_info = pdf_toread.getDocumentInfo()

手头的文档输出为：

Out[230]: 
{'/CrossmarkDomainExclusive': 'true',
 '/CreationDate': "D:20181029074117+05'30'",
 '/CrossMarkDomains#5B2#5D': 'elsevier.com',
 '/Author': 'Nicola Gennaioli',
 '/Creator': 'Elsevier',
 '/ElsevierWebPDFSpecifications': '6.5',
 '/Subject': 'Journal of Monetary Economics, 98 (2018) 98-113. doi:10.1016/j.jmoneco.2018.04.011',
 '/CrossmarkMajorVersionDate': '2010-04-23',
 '/CrossMarkDomains#5B1#5D': 'sciencedirect.com',
 '/robots': 'noindex',
 '/ModDate': "D:20181029074135+05'30'",
 '/AuthoritativeDomain#5B1#5D': 'sciencedirect.com',
 '/Keywords': 'Sovereign Risk; Sovereign Default; Government Bonds',
 '/doi': '10.1016/j.jmoneco.2018.04.011',
 '/Title': 'Banks, government Bonds, and Default: What do the data Say?',
 '/AuthoritativeDomain#5B2#5D': 'elsevier.com',
 '/Producer': 'Acrobat Distiller 10.1.10 (Windows)'}

然而，我发现PyPDF2库没有一个属性可以“访问”/Keywords部分的信息。也就是说，这段输出：

'/Keywords': 'Sovereign Risk; Sovereign Default; Government Bonds',

所以，我想寻求关于如何获取元数据输出信息的帮助[在这个例子中： 主权风险; 主权违约; 政府债券]。

为了重现输出结果，我分享了一个文档链接。

例如执行：

更新：

print(pdf_info.title)
Banks, government Bonds, and Default: What do the data Say?

print(pdf_info.subject)
Journal of Monetary Economics, 98 (2018) 98-113. doi:10.1016/j.jmoneco.2018.04.011

但是当我尝试对/Keywords部分进行类似操作时，我遇到了以下属性错误：

pdf_info.keywords
Traceback (most recent call last):

  File "<ipython-input-295-3852401ef983>", line 1, in <module>
    pdf_info.keywords

AttributeError: 'DocumentInformation' object has no attribute 'keywords'

- msh855

您的链接需要获得访问权限。 - ashutosh singh

“access”（引用）是什么意思？我可以看到就在那里有/Keywords条目。 - Jongware

如果我执行 pdf_toread.title，我可以获取标题，但是当我执行 pdf_toread.keywords 时，会出现错误，提示该属性不存在。我查看了一下，PyPDF2 的作者确实没有编写代码来获取关键字，就像你可以获取标题或者主题信息一样。 - msh855

你能解释一下关键字的意思吗？是指你想要整个文本中的标题或关键字吗？请举例说明一些关键字。 - ashutosh singh

请检查我的更新。我认为我的要求很明确。基本上，我正在读取PDF的元数据，如我在问题中所示，它作为一个PDF文档对象，并且输出如问题中所示。在那里，可以看到标题、作者以及一个名为“关键词”的部分，在输出中，我展示了“'/ Keywords'：'Sovereign Risk; Sovereign Default; Government Bonds'”。虽然我可以通过pdf_info.title获取标题的信息，但我无法获取关键字的信息——上面的输出（元数据）清楚地显示它存在。 - msh855

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jongware · Accepted Answer

/Keywords关键字实际上存在于getDocumentInfo返回的字典中，因此您无需执行任何特殊操作（除非首先测试它是否存在或将其包装在一个try中，以防它不存在于另一个文件中）：

from PyPDF2 import PdfFileReader    
mypath = "../Downloads/banks_gov_bonds_default.pdf"
pdf_toread = PdfFileReader(open(mypath, 'rb'))
pdf_info = pdf_toread.getDocumentInfo()
if '/Keywords' in pdf_info:
    print (pdf_info['/Keywords'])

打印

Sovereign Risk; Sovereign Default; Government Bonds

这些关键词确实是你的示例PDF文件内部领域中的关键词。

另一个选择是通过编辑 pdf.py 文件（位于您的 pip 放置它的 PYPDF2 文件夹内）向 PDF 属性添加 keywords。在我的版本中，您可以在类 DocumentInformation 中找到创建 title、creator、author 以及其他一些属性的代码，大约在第2781行左右。所有这些属性的创建都遵循简单的方案，因此添加一个属性并不成问题：

keywords = property(lambda self: self.getText("/Keywords"))
"""Read-only property accessing the document's **producer**.
If the document was converted to PDF from another format, this is
the name of the application (for example, OSX Quartz) that converted
it to PDF. Returns a unicode string (``TextStringObject``)
or ``None`` if the producer is not specified."""
keywords_raw = property(lambda self: self.get("/Keywords"))
"""The "raw" version of producer; can return a ``ByteStringObject``."""

我添加了keywords_raw是因为其他属性也这样做了。但我一时也无法确定这些关键字的具体用途。

之后，您的代码实际上可以正常工作：

from PyPDF2 import PdfFileReader    
mypath = "../Downloads/banks_gov_bonds_default.pdf"
pdf_toread = PdfFileReader(open(mypath, 'rb'))
pdf_info = pdf_toread.getDocumentInfo()
print (pdf_info.keywords)

再次呈现结果：

Sovereign Risk; Sovereign Default; Government Bonds