使用Python从MS Word文件中提取文本

Question

使用Python从MS Word文件中提取文本

pythonlinuxms-word

31

在Python中处理MS Word文件，可以使用Python Win32扩展，在Windows系统中使用。那在Linux操作系统下如何实现相同的功能？有没有相关的库可用？

- Badri

你能定义一下“working with”吗？是仅限于阅读，还是包括写入操作？ - Mawg says reinstate Monica

15个回答

22

您可以调用子进程来使用Antiword。Antiword是一个Linux命令行实用程序，用于从Word文档中提取文本。对于简单的文档（显然会失去格式），它的表现相当不错。它可通过apt获取，可能也有RPM版本，或者您可以自行编译。

- John Fouhy

1

antiword 可以将 Word 文档转换为 DocBook XML，这将保留（至少一些）格式。 - Marius Gedminas

20

Benjamin的回答非常不错。我刚刚整理了一下...

import zipfile, re

docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)

- Chad

3

我应该再强调一下，这仅适用于docx（Word 2007或更高版本）。对于.doc文件，wvware是最佳选择。根据您的环境，设置可能会有些麻烦，但它会做得非常好。 - Chad

3

为了从文本中删除类似于的XML实体:

从xml.sax.saxutils导入unescape text = unescape(cleaned)

- Jesvin Jose

1

content = docx.read('word/document.xml').decode('utf-8') 否则在清理时会出现错误：TypeError: cannot use a string pattern on a bytes-like object - me_astr

11

OpenOffice.org可以使用Python进行脚本编写：请参见此处。

由于OOo可以完美加载大多数MS Word文件，我认为那是您最好的选择。

- Dan

10

我的经验是（OO 2.0-3.0版本）它接近完美，但并不是完全无瑕疵的。 - SpliFF

6

我会尽力进行翻译：与 MS Word N 文件相比，MS Word N+1 打开文件的表现完美无缺；而且在我看来，它比打开 MS Word N-1 文件的效果要好得多。 - Esteban Küber

7

我知道这是一个旧问题，但最近我试图找到一种从MS Word文件中提取文本的方法，目前为止我发现最好的解决方案是使用wvLib： http://wvware.sourceforge.net/ 安装完库之后，在Python中使用它非常容易：

import commands

exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)

就是这样。基本上，我们正在使用commands.getouput函数来运行一些shell脚本，即wvText（从Word文档中提取文本）和cat（读取文件输出）。之后，整个Word文档的所有文本都将在out变量中，可以随时使用。

希望这能帮助未来遇到类似问题的人们。

- Dave

4

要读取Word 2007及更高版本文件，包括.docx文件，您可以使用python-docx软件包：

from docx import Document
document = Document('existing-document-file.docx')
document.save('new-file-name.docx')

若要读取Word 2003及以前版本的.doc文件，请调用子进程antiword。您需要先安装antiword：

sudo apt-get install antiword

那么您只需要从Python脚本中调用它即可：

import os
input_word_file = "input_file.doc"
output_text_file = "output_file.txt"
os.system('antiword %s > %s' % (input_word_file, output_text_file))

- Antoine Dusséaux

4

（注意：我也在这个问题上发布了这个帖子，但是它似乎与此相关，所以请原谅我重新发布。）

现在，这很丑陋而且相当hacky，但对于基本文本提取来说，它似乎对我有效。显然，在Qt程序中使用它，您必须为其生成一个进程，但我已经组合好了以下命令行：

unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

所以：

unzip -p file.docx: -p == "将文件解压到标准输出"

grep '<w:t': 抓取包含“<w:t”(据我所知，<w:t>是Word 2007 XML元素，表示“文本”)的行

sed 's/<[^<]>//g'*: 删除标签内的所有内容

grep -v '^[[:space:]]$'*: 删除空行

可能有更有效的方法来完成这个操作，但在我测试过的一些文档上似乎可行。

据我所知，unzip、grep 和 sed 都有适用于 Windows 和任何 Unix 的版本，因此应该相当跨平台。尽管这是一种比较丑陋的 hack ;)

- Ben Williams

4

请看一下doc格式的工作原理和在Linux中使用PHP创建Word文档。前者特别有用。我推荐使用Abiword工具。然而，它还有一些限制：

如果文档包含复杂的表格、文本框、嵌入式电子表格等，则可能无法按预期工作。开发良好的MS Word过滤器是一个非常困难的过程，因此请耐心等待我们努力使Word文档正确打开。如果您有一个无法加载的Word文档，请打开Bug并包含该文档，以便我们改进导入程序。

- Swati

不仅如此！即使是以Word 97格式保存的最基本文本，也几乎不可能轻松获取，除非依赖Word（COM）来完成。大多数Word文档都不是HTML！ - William Keller

Abiword并不假设它是HTML文档，考虑到这个工具的广泛性...我认为实现它并不“容易”。Abiword是一个帮助你阅读MS Word文件的工具...由于作者关注文本检索，这已经足够了。 - Swati

啊，我一直以为abiword只是另一个文字处理器！哎呀，要是早些时候知道这个就能让我省掉些麻烦了。 - William Keller

4

Unoconv可能也是一个不错的选择：http://linux.die.net/man/1/unoconv

- fccoelho

4

如果你的意图是仅使用Python模块而不调用子进程，你可以使用zipfile Python模块。

content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
    if item.orig_filename == 'word/document.xml':
        content = docx.read(item.orig_filename)

    else:
        pass

您的内容字符串需要进行清理，其中一种方法是：

# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
    if '>' in item:
        bad_good = item.split('>')
        if bad_good[-1] != '':
            fullyclean.append(bad_good[-1])
        else:
            pass
    else:
        pass

# Assemble a new string with all pure content
content = " ".join(fullyclean)

但是，肯定有更加优雅的方法来清理字符串，可能需要使用re模块。希望这能帮到你。

- benjamin

1

从'text'中删除XML实体，如：

from xml.sax.saxutils import unescape text = unescape(content)

- Jesvin Jose

使用re模块，清理工作可以更加轻松： stripped_content = re.compile(b'<.*?>').sub(b' ', content) # strip tags 你代码中有一件事我不明白，在前面的片段中为什么你没有在if块内部使用break？ - Vikas Prasad

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mikemaccana · Accepted Answer

使用 原生Python docx模块。以下是从doc中提取所有文本的方法：

document = docx.Document(filename)
docText = '\n\n'.join(
    paragraph.text for paragraph in document.paragraphs
)
print(docText)

请参考Python DocX网站

另外请查看Textract，可以提取表格等内容。

使用正则表达式解析XML会唤醒克苏鲁。不要这样做！