使用Python从Word文档中提取图像

Question

使用Python从Word文档中提取图像

4

我该如何使用Python从Word文档中提取图像/徽标并将它们存储在文件夹中？下面的代码可以将docx转换为html，但它无法从html中提取图像。任何指针或建议都将是极大的帮助。

最初的回答：

您可以使用python-docx2txt库将.docx文件转换为文本，并使用正则表达式从文本中提取图像的base64编码。然后，您可以使用base64解码器将其转换回图像，并将其保存到文件夹中。以下是一个示例代码片段：

    profile_path = <file path>
    result=mammoth.convert_to_html( profile_path)
    f = open(profile_path, 'rb')
    b = open(profile_html, 'wb')
    document = mammoth.convert_to_html(f)
    b.write(document.value.encode('utf8'))
    f.close()
    b.close()

- Softchamp

1

这可能会帮助你：点击此处 - sahasrara62

如果你被允许转换Word文件，你可以尝试将其转换为PDF，然后使用此处描述的方法之一提取图像：https://dev59.com/L3E85IYBdhLWcg3wnU0d 我不知道它是否能完全满足你的需求，但我认为值得一试。 - Daweo

4个回答

2

您可以使用docx2txt库，它可以读取您的.docx文档并将图像导出到指定目录（必须存在）。

!pip install docx2txt
import docx2txt
text = docx2txt.process("/path/your_word_doc.docx", '/home/example/img/')

执行后，您将在/home/example/img/中拥有图像，并且变量text将具有文档文本。它们将按照出现顺序命名为image1.png ... imageN.png。

注意：Word文档必须是.docx格式。

- gabriel capparelli

1

使用Python提取docx文件中的所有图片

1. 使用docxtxt库

要提取docx文件中的所有图片，可以使用docxtxt库。该库允许您将.docx文件转换为纯文本，并且在此过程中会保留所有图片。

以下是使用docxtxt库提取docx文件中所有图片的示例代码：

import docx2txt
#extract text 
text = docx2txt.process(r"filepath_of_docx")
#extract text and write images in Temporary Image directory
text = docx2txt.process(r"filepath_of_docx",r"Temporary_Image_Directory")

2. Using aspose

import aspose.words as aw
# load the Word document
doc = aw.Document(r"filepath")
# retrieve all shapes
shapes = doc.get_child_nodes(aw.NodeType.SHAPE, True)
imageIndex = 0
# loop through shapes
for shape in shapes :
    shape = shape.as_shape()
    if (shape.has_image) :
        # set image file's name
        imageFileName = f"Image.ExportImages.{imageIndex}_{aw.FileFormatUtil.image_type_to_extension(shape.image_data.image_type)}"
        # save image
        shape.image_data.save(imageFileName)
        imageIndex += 1

- dataninsight

1

看看Alderven在使用Python提取docx文件中的所有图像方面的回答。

zipfile适用于比docx2txt更多的图像格式。例如，docx2txt无法提取EMF图像，但可以通过zipfile提取。

- FilipeTavares

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- K J · Accepted Answer

不使用任何库的本地化

从docx文件中提取源图像（它是zip文件的一种变体）而不会发生扭曲或转换。

通过shell调用操作系统并运行以下命令：

tar -m -xf DocxWithImages.docx word/media

您会在文档媒体文件夹中找到源图像Jpeg、PNG、WMF或其他格式的文件，它们被提取到以该名称命名的文件夹中。这些是未经过缩放或裁剪的原始嵌入物。

您可能会惊讶于可见区域可能比docx本身使用的任何裁剪版本都要大，因此需要注意Word并不总是按预期裁剪图像（尴尬的删除失败的来源）