使用Python解析docx文件

Question

使用Python解析docx文件

pythonregexpython-docx

3

我正在尝试从多个docx文件中读取标题。令人烦恼的是，这些标题没有明显的段落样式。所有段落都具有“正常”段落样式，因此我正在使用正则表达式。标题以粗体格式排版，并按以下结构组织:

A. 猫

B. 狗

C. 猪

D. 狐狸

如果一个文件中有超过26个标题，则标题会以“AA.”，“BB.”等开头。

我有以下代码，它基本起作用，但是任何以“D.”开头的标题都会打印两次，例如[Cat，Dog，Pig，Fox，Fox]。

import os
from docx import Document
import re

directory = input("Copy and paste the location of the files.\n").lower()

for file in os.listdir(directory):

    document = Document(directory+file)

    head1s = []

    for paragraph in document.paragraphs:

        heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)

        for run in paragraph.runs:

            if run.bold:

                if heading:
                    head1 = paragraph.text
                    head1 = head1.split('.')[1]
                    head1s.append(head1)

    print(head1s)

有人能告诉我代码是否存在问题，导致出现这种情况吗？就我所知，Word文件中这些特定标题的格式或结构并没有什么特别之处。

- Kat hughes

2个回答

0

你也可以从同一库中使用 style.name 运行

def find_headings(doc_path):
#find headings in doc
doc = docx.Document(doc_path)
headings = []
for i, para in doc.paragraphs:
    if para.style.name == 'Heading 1':
        headings.append(para.text)
return headings

- user1890239

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- glycoaddict · Accepted Answer

发生的情况是循环继续超过了D.Fox，因此在这个新循环中，即使没有匹配，它也会打印head1的最后一个值，即D.Fox。

我认为是for run in paragraph.runs:在某种程度上运行了两次，也许有第二个“run”存在但是不可见？

也许当找到第一个匹配时添加一个break就足以防止第二次运行触发？

for file in os.listdir(directory):

document = Document(directory+file)

head1s = []

for paragraph in document.paragraphs:

    heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)

    for run in paragraph.runs:

        if run.bold:

            if heading:
                head1 = paragraph.text
                head1 = head1.split('.')[1]
                head1s.append(head1)
                # this break stops the run loop if a match was found.
                break

print(head1s)