使用Python提取docx文件中的所有图片

Question

使用Python提取docx文件中的所有图片

4

我有一个包含6-7张图片的docx文件，我需要自动提取其中的图片。是否有任何win32com ms word API或可以准确提取其中所有图像的库？

这是我尝试过的方法，但问题在于它首先没有给我所有的图像，其次它给了我许多假阳性图像，比如空白图像、极小图像、线条等等... 它也使用MS Word来完成相同的任务。

from pathlib import Path
from win32com.client import Dispatch

xls = Dispatch("Excel.Application")
doc = Dispatch("Word.Application")


def export_images(fp, prefix="img_", suffix="png"):
    """ export all of images(inlineShapes) in the word file.
    :param fp: path of word file.
    :param prefix: prefix of exported images.
    :param suffix: suffix of exported images.
    """

    fp = Path(fp)
    word = doc.Documents.Open(str(fp.resolve()))
    sh = xls.Workbooks.Add()
    for idx, s in enumerate(word.inlineShapes, 1):
        s.Range.CopyAsPicture()
        d = sh.ActiveSheet.ChartObjects().add(0, 0, s.width, s.height)
        d.Chart.Paste()
        d.Chart.Export(fp.parent / ("%s_%s.%s" % (prefix, idx, suffix))
    sh.Close(False)
    word.Close(False)
export_images(r"C:\Users\HPO2KOR\Desktop\Work\venv\us2017010202.docx")

你可以在这里下载docx文件：https://drive.google.com/open?id=1xdw2MieI1n3ulXlkr_iJSKb3cbozdvWq。该文件与it技术相关。

- Himanshu Poddar

4个回答

0

使用 Python 提取 docx 文件中的所有图片

1. 使用 docxtxt

import docx2txt
#extract text 
text = docx2txt.process(r"filepath_of_docx")
#extract text and write images in Temporary Image directory
text = docx2txt.process(r"filepath_of_docx",r"Temporary_Image_Directory")

2. 使用Aspose

import aspose.words as aw
# load the Word document
doc = aw.Document(r"filepath")
# retrieve all shapes
shapes = doc.get_child_nodes(aw.NodeType.SHAPE, True)
imageIndex = 0
# loop through shapes
for shape in shapes :
    shape = shape.as_shape()
    if (shape.has_image) :
        # set image file's name
        imageFileName = f"Image.ExportImages.{imageIndex}_{aw.FileFormatUtil.image_type_to_extension(shape.image_data.image_type)}"
        # save image
        shape.image_data.save(imageFileName)
        imageIndex += 1

- dataninsight

1

请不要在多个问题中发布相同的答案。请根据具体问题进行定制化回答。 - tdy

0

在你的枚举循环中，你应该检查一下形状类型是否为图片。

for idx, s in enumerate(word.inlineShapes, 1):
    if s.Type != 3: # wdInlineShapePicture
        continue
    # ...

- Torben Klein

我仍然没有得到所有的6-7张图片。 - Himanshu Poddar

作为一条通用提示，您可以使用Word宏编辑器（Alt+F11），更准确地说是“直接窗口”（Ctrl+G），来操作对象模型并检查实际的数据结构。 - Torben Klein

嗨 @Torben，我是个新手，如果你能稍微解释一下就太好了。 - Himanshu Poddar

很难在评论中完成。而且很多相关资料可以轻松找到：https://www.google.com/search?q=vba+editor+tutorial - Torben Klein

0

为了做同样的事情，再添加一种方法。我们可以使用doc2txt库来获取所有的图片。

import docx2txt
text = docx2txt.process("docx_file", r"directory where you want to store the images")

请注意，它还会将文件中找到的所有文本都存储在文本变量中。

- Himanshu Poddar

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alderven · Accepted Answer

您可以通过大小对docx文件进行初步筛选，然后解压其中所有图片：

import zipfile

archive = zipfile.ZipFile('file.docx')
for file in archive.filelist:
    if file.filename.startswith('word/media/') and file.file_size > 300000:
        archive.extract(file)

在您的示例中找到了5张图片：