如何使用python-docx从现有的docx文件中提取文本

Question

如何使用python-docx从现有的docx文件中提取文本

86

我正在尝试使用python-docx模块 (pip install python-docx)，但是在GitHub存储库的测试样例中，他们在使用opendocx函数，而在readthedocs中则使用Document类。即使它们只展示了如何向docx文件添加文本，而没有阅读现有的文件？

第一个函数opendocx不起作用，可能已被弃用。对于第二种情况，我正在尝试使用:

from docx import Document

document = Document('test_doc.docx')
print(document.paragraphs)

它返回了一个列表，其中包含<docx.text.Paragraph object at 0x... >

然后我做了：

for p in document.paragraphs:
    print(p.text)

它返回了所有的文本，但是有些东西丢失了。所有的URL（CTRL + 点击跳转到URL）都没有出现在控制台的文本中。

问题是什么？为什么URL丢失了？

我怎样才能获得完整的文本而不需要遍历循环（类似于open().read()）?

- Nancy

请注意，旧的 GitHub 存储库 https://github.com/mikemaccana/python-docx 在标题 1 中显示“此项目已移动！”。 - mikemaccana

此外，所有编号列表都未导出为文本... - robob

6个回答

26

您可以使用python-docx2txt，它是从python-docx改编而来，但也可以从链接、页眉和页脚中提取文本。它还可以提取图片。

- Ankush Shah

这是一段有用的代码，但它不能导出编号列表。 - robob

谢谢，这是此错误的跟踪问题。 - Ankush Shah

更新版本在此包中：https://github.com/ShayHill/docx2python - Roland Pihlakas

17

不需要安装 python-docx

docx 实际上是一个包含多个文件和文件夹的 zip 文件。在下面的链接中，您可以找到一个简单的函数，用于提取 docx 文件中的文本，而无需依赖于 python-docx 和 lxml，后者有时很难安装:

http://etienned.github.io/posts/extract-text-from-word-docx-simply/

- imanzabet

我用你的代码运行后出现了“zipfile.BadZipFile: File is not a zip file”的错误信息。这是为什么呢？ - John Smith

这段代码之前对我有效。你能上传你的docx文件并提供一个链接让我测试吗？ - imanzabet

2

这仍然有效，但是.getiterator()已被弃用，现在必须替换为.iter() https://docs.python.org/3.9/whatsnew/3.9.html#removed - Ping Lu

8

python-docx有两个“版本”，初始版本在0.2.x版本结束，而“新”版本从v0.3.0开始。新版本是对旧版本的完全重写，采用面向对象方式构建。它有一个独立的代码库，位于这里。

opendocx()函数是旧版API的一部分，文档是针对新版本的。旧版本没有可言的文档。

当前版本不支持读取和写入超链接。该功能在开发路线图中，并且项目正在积极开发中。因为Word有很多功能，所以它实际上是一个相当广泛的API。我们将实现它，但可能不会在下个月内完成，除非有人决定专注于该方面并做出贡献。更新此回答后，已经添加了超链接支持。

- scanny

这是否已在最新版本中修复 - 从Github很难说。 - acutesoftware

7

使用 python-docx，就像 @Chinmoy Panda 的回答所示：

for para in doc.paragraphs:
    fullText.append(para.text)

然而，para.text将丢失w:smarttag中的文本（对应的github问题在这里：https://github.com/python-openxml/python-docx/issues/328），你应该使用以下函数代替：

def para2text(p):
    rs = p._element.xpath('.//w:t')
    return u" ".join([r.text for r in rs])

- Xing Shi

0

看起来似乎没有官方的解决方案，但是在这里发布了一个解决方法 https://github.com/savoirfairelinux/python-docx/commit/afd9fef6b2636c196761e5ed34eb05908e582649

只需更新此文件 "...\site-packages\docx\oxml_init_.py"

# add
import re
import sys

# add
def remove_hyperlink_tags(xml):
    if (sys.version_info > (3, 0)):
        xml = xml.decode('utf-8')
    xml = xml.replace('</w:hyperlink>', '')
    xml = re.sub('<w:hyperlink[^>]*>', '', xml)
    if (sys.version_info > (3, 0)):
        xml = xml.encode('utf-8')
    return xml
    
# update
def parse_xml(xml):
    """
    Return root lxml element obtained by parsing XML character string in
    *xml*, which can be either a Python 2.x string or unicode. The custom
    parser is used, so custom element classes are produced for elements in
    *xml* that have them.
    """
    root_element = etree.fromstring(remove_hyperlink_tags(xml), oxml_parser)
    return root_element

当然，不要忘记在文档中提到您正在更改官方库的使用方式。

- Andrey Mazur

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Chinmoy Panda · Accepted Answer

74

你可以尝试这个

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

- Chinmoy Panda

16

这是一个不错的开始，但它并没有反映出表格、页眉、页脚和脚注中的文本。 - guerda

6

考虑使用 simplify-docx 工具，它基于 python-docx，并且大幅减少了 XML 文件的复杂性，同时保留了文档结构（段落、表格、页眉、页脚等）。 - Jthorpe

7

这与提问者使用的方法有何不同？事实上，它甚至更糟糕，因为它创建了一个愚蠢且无用的列表，而不是一段文本！并且我看到有59个人赞同这种回答！！实际上他们应该被点踩！（我没有点踩，因为我从不这样做。我更喜欢解释为什么这样的回复真的很糟糕！） - Apostolos

确实，这只是确认该问题难以解决。 - Jean-François Fabre

有趣的一行代码：'\n'.join([p.text for p in doc.paragraphs]) - Matt