如何使用python-pptx从PowerPoint中的组合形状中提取文本形状的文本。

Question

如何使用python-pptx从PowerPoint中的组合形状中提取文本形状的文本。

5

我的PowerPoint幻灯片中有许多组合图形，其中包含子文本图形。

之前我使用的代码，但它不能处理组合图形。

for eachfile in files:
prs = Presentation(eachfile)

textrun=[]
for slide in prs.slides:
    for shape in slide.shapes:
        if hasattr(shape, "text"):
            print(shape.text)
            textrun.append(shape.text)
new_list=" ".join(textrun)
text_list.append(new_list)

我正在尝试从这些子文本框中提取文本。我已经成功使用GroupShape.shape访问到了这些子元素。但是，我遇到了一个错误，提示它们是“property”类型，因此我无法访问文本或迭代（TypeError: 'property' object is not iterable）。

from pptx.shapes.group import GroupShape
from pptx import Presentation
for eachfile in files:
prs = Presentation(eachfile)

textrun=[]
for slide in prs.slides:
    for shape in slide.shapes:
        for text in GroupShape.shapes:
            print(text)

我希望能够获取文本并将其附加到一个字符串中以进行进一步处理。

我的问题是，如何访问子文本元素并从中提取文本。

我已经花了很多时间阅读文档和源代码，但还没有弄清楚。任何帮助都将不胜感激。

- sjm20066

3个回答

4

之前的答案忽略了一些更深层次的“群组中的群组”情况。群组形状可能包含许多级别的形状，包括群组形状。因此，在许多实际情况下需要对群组形状进行递归搜索。

之前的答案仅解析其中一些（向下解析到第二层群组形状）。但即使那个层级的群组形状本身也可能包含进一步的群组。因此，我们需要一种迭代搜索策略。最好是重用上面的代码，保持前半部分：

from pptx.shapes.group import GroupShape
from pptx import Presentation
for eachfile in files:
prs = Presentation(eachfile)

textrun=[]
for slide in prs.slides:
    for shape in slide.shapes:

接下来，我们需要用递归部分的调用替换"for text in GroupShape.shapes:"测试:

    textrun=checkrecursivelyfortext(slide.shapes,textrun)

并且在导入语句之后插入一个新的递归函数定义（类似于上面的代码）。为了更方便地进行比较，插入的函数使用与上面相同的代码，只是添加了递归部分：

def checkrecursivelyfortext(shpthissetofshapes,textrun):
    for shape in shpthissetofshapes:
        if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
            textrun=checkrecursivelyfortext(shape.shapes,textrun)
        else:
            if hasattr(shape, "text"):
                print(shape.text)
                textrun.append(shape.text)
    return textrun

- Mats Bengtsson

我在下面发布了一个修复答案的解决方案。 - garchompstomp

0

Mats Bengtsson的答案非常准确，但存在一个小的逻辑错误，它会导致重新循环遍历对象，一些非Pythonic的命名和一个缺失的导入。

错误在这里：

for slide in prs.slides:
    for shape in slide.shapes:
        textrun = checkrecursivelyfortext(slide.shapes,textrun)

由于他创建的函数循环遍历幻灯片中的所有形状，因此最终结果是对幻灯片上的每个形状进行递归循环遍历！

这个修复很简单，只需删除第二个循环“for shape in slide.shapes”，直接进入递归函数即可。

为了可读性，我将发布整个修复代码片段。

from pptx.shapes.group import GroupShape
from pptx.enum.shapes import MSO_SHAPE_TYPE
from pptx import Presentation

def check_recursively_for_text(this_set_of_shapes, text_run):
    for shape in this_set_of_shapes:
        if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
            check_recursively_for_text(shape.shapes, text_run)
        else:
            if hasattr(shape, "text"):
                print(shape.text)
                text_run.append(shape.text)
    return text_run


for eachfile in files:
    prs = Presentation(eachfile)
    text_run=[]
    for slide in prs.slides:
        text_run = check_recursively_for_text(slide.shapes, text_run)

- garchompstomp

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- scanny · Accepted Answer

我认为你需要类似这样的东西：

from pptx.enum.shapes import MSO_SHAPE_TYPE

for slide in prs.slides:
    # ---only operate on group shapes---
    group_shapes = [
        shp for shp in slide.shapes
        if shp.shape_type == MSO_SHAPE_TYPE.GROUP
    ]
    for group_shape in group_shapes:
        for shape in group_shape.shapes:
            if shape.has_text_frame:
                print(shape.text)

一个组合形状包含其他形状，可以通过其 .shapes 属性访问。它本身没有 .text 属性。因此，您需要遍历组中的形状，并从每个形状获取文本。

请注意，此解决方案仅向下遍历一层。如果有包含组的组，则可以使用递归方法来深度优先遍历树并从中获取文本。

还要注意，并非所有形状都具有文本，因此必须检查 .has_text_frame 属性，以避免在图片形状等上引发异常。