通过其XML读取Word文档的内容

Question

通过其XML读取Word文档的内容

excelxmlvbams-word

3

背景

我正在尝试在Excel中构建一个Word文档浏览器，以筛选大量文档（约1000个）。

打开Word文档的过程证明相当缓慢（每个文档约4秒钟，因此在这种情况下需要2个小时才能查看所有项目，这对于单个查询来说太慢了），即使禁用了可能会减慢打开速度的所有内容。因此，我采取以下步骤来打开：

只读方式打开
不使用打开和修复模式（在某些文档上可能会发生）
禁用文档的显示

我的研究所 far

这些文档很难搜索，因为有些关键字每次都出现，但上下文不同（当加载到数组时，这不是核心问题）。因此，经常使用的Windows资源管理器解决方案（如此链接）在我的情况下无法使用。

目前，我已经成功编写了一个宏，通过打开文档来分析Word文档的内容。

代码

以下是代码示例。注意，我使用了Microsoft Word 14.0 Object Library引用。

' Analyzing all the word document within the same folder '
Sub extractFile()

Dim i As Long, j As Long
Dim sAnalyzedDoc As String, sLibName As String
Dim aOut()
Dim oWordApp As Word.Application
Dim oDoc As Word.Document

Set oWordApp = CreateObject("Word.Application")

sLibName = ThisWorkbook.Path & "\"
sAnalyzedDoc = Dir(sLibName)
sKeyword = "example of a word"

With Application
    .DisplayAlerts = False
    .ScreenUpdating = False
End With

ReDim aOut(2, 2)
aOut(1, 1) = "Document name"
aOut(2, 1) = "Text"


While (sAnalyzedDoc <> "")
    ' Analyzing documents only with the .doc and .docx extension '
    If Not InStr(sAnalyzedDoc, ".doc") = 0 Then
        ' Opening the document as mentionned above, in read only mode, without repair and invisible '
        Set oDoc = Word.Documents.Open(sLibName & "\" & sAnalyzedDoc, ReadOnly:=True, OpenAndRepair:=False, Visible:=False)
        With oDoc
            For i = 1 To .Sentences.Count
                ' Searching for the keyword within the document '
                If Not InStr(LCase(.Sentences.Item(i)), LCase(sKeyword)) = 0 Then
                    If Not IsEmpty(aOut(1, 2)) Then
                        ReDim Preserve aOut(2, UBound(aOut, 2) + 1)
                    End If
                    aOut(1, UBound(aOut, 2)) = sAnalyzedDoc
                    aOut(2, UBound(aOut, 2)) = .Sentences.Item(i)
                    GoTo closingDoc ' A dubious programming choice but that works for the moment '
                End If
            Next i
closingDoc:
            ' Intending to make the closing faster by not saving the document '
            .Close SaveChanges:=False
        End With
    End If
    'Moving on to the next document '
    sAnalyzedDoc = Dir
Wend

exitSub:
With Output
    .Range(.Cells(1, 1), .Cells(UBound(aOut, 1), UBound(aOut, 2))) = aOut
End With

With Application
    .DisplayAlerts = True
    .ScreenUpdating = True
End With

End Sub

我的问题

我想到的想法是通过文档中的XML内容直接访问其内容（您可以在新版本的Word中重命名任何文档时访问它，使用.zip扩展名并转至nameOfDocument.zip\word\document.xml）。

这比加载文档中所有无用的图像、图表和表格要快得多，在文本搜索中也没有用处。

因此，我想问一下，在VBA中是否有一种打开Word文档并像打开zip文件一样访问该XML文档，然后在VBA中处理它像普通字符字符串一样的方法，因为我已经拥有了给定上述代码的路径和文件名称。

- Pierre Chevallier

1

您可以通过Shell对象直接访问压缩文件（http://www.rondebruin.nl/win/s7/win002.htm），但随后你会被困在解析XML（https://dev59.com/M3VD5IYBdhLWcg3wWKLc）的泥淖中，同时Word具有可怕的底层XML处理方式。祝好运。 - Mikegrann

1

请查看VBA宏以搜索关键词的文件夹。通过使用所述的“FindFiles”函数（使用第二个版本），您将利用Windows索引中所有文档中的所有单词。 - PeterT

谢谢你们两位，我会查看链接并尝试制作一些东西。 - Pierre Chevallier

目前为止，我已经得出结论，我想做的事情（即在不更改扩展名的情况下编辑.docx）无法在VBA中完成。我目前正在用C#编写一个DLL，可能会解决类似于MSDN上找到的代码的问题。我希望很快能发布相关内容。 - Pierre Chevallier

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pierre Chevallier · Accepted Answer

请注意，这不是对上述问题的简单回答，我的初始问题中唯一的VBA代码可以完美地完成工作，只要您不需要浏览大量文档，否则请选择另一个工具（有一个Python Dynamic Link Library (DLL)非常适合这个任务）。

好的，我会尽可能详细地解释我的答案。

首先，这个问题让我陷入了C#和XPath中XML的无限旅程，但我选择在某个时候停止追求。

它将分析文件的时间从大约2小时缩短到了10秒钟。

背景

阅读XML文档以及内部Word XML文档的支柱是Microsoft的OpenXML库。请记住我上面说的话，我试图实现的方法不能仅通过VBA完成，因此必须用另一种方式完成。这可能是因为VBA是为Office实现的，因此在访问Office文档的核心结构方面受到限制，但我没有关于这种限制的信息（欢迎提供任何信息）。

这里我将给出的答案是使用C#为VBA编写DLL。要在C#中编写DLL并在VBA中引用它，我建议您查看以下链接，该链接将更好地概述此特定过程：C# DLL创建教程。

让我们开始

首先，您需要在项目中引用WindowsBase库和DocumentFormat.OpenXML，以使解决方案按照MSDN文章操作Office Open XML格式文档和打开并向文字处理文档添加文本（Open XML SDK）所述正常工作。这些文章广泛解释了如何使用OpenXML库来操作Word文档。

C#代码

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.IO.Packaging;

namespace BrowserClass
{

    public class SpecificDirectory
    {

        public string[,] LookUpWord(string nameKeyword, string nameStopword, string nameDirectory)
        {
            string sKeyWord = nameKeyword;
            string sStopWord = nameStopword;
            string sDirectory = nameDirectory;

            sStopWord = sStopWord.ToLower();
            sKeyWord = sKeyWord.ToLower();

            string sDocPath = Path.GetDirectoryName(sDirectory);
            // Looking for all the documents with the .docx extension
            string[] sDocName = Directory.GetFiles(sDocPath, "*.docx", SearchOption.AllDirectories);
            string[] sDocumentList = new string[1];
            string[] sDocumentText = new string[1];

            // Cycling the documents retrieved in the folder
            for (int i = 0; i < sDocName.Count(); i++)
            {
                string docWord = sDocName[i];

                // Opening the documents as read only, no need to edit them
                Package officePackage = Package.Open(docWord, FileMode.Open, FileAccess.Read);

                const String officeDocRelType = @"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";

                PackagePart corePart = null;
                Uri documentUri = null;

                // We are extracting the part with the document content within the files
                foreach (PackageRelationship relationship in officePackage.GetRelationshipsByType(officeDocRelType))
                {
                    documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);
                    corePart = officePackage.GetPart(documentUri);
                    break;
                }

                // Here enter the proper code
                if (corePart != null)
                {
                    string cpPropertiesSchema = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties";
                    string dcPropertiesSchema = "http://purl.org/dc/elements/1.1/";
                    string dcTermsPropertiesSchema = "http://purl.org/dc/terms/";

                    // Construction of a namespace manager to handle the different parts of the xml files
                    NameTable nt = new NameTable();
                    XmlNamespaceManager nsmgr = new XmlNamespaceManager(nt);
                    nsmgr.AddNamespace("dc", dcPropertiesSchema);
                    nsmgr.AddNamespace("cp", cpPropertiesSchema);
                    nsmgr.AddNamespace("dcterms", dcTermsPropertiesSchema);

                    // Loading the xml document's text
                    XmlDocument doc = new XmlDocument(nt);
                    doc.Load(corePart.GetStream());

                    // I chose to directly load the inner text because I could not parse the way I wanted the document, but it works so far
                    string docInnerText = doc.DocumentElement.InnerText;
                    docInnerText = docInnerText.Replace("\\* MERGEFORMAT", ".");
                    docInnerText = docInnerText.Replace("DOCPROPERTY ", "");
                    docInnerText = docInnerText.Replace("Glossary.", "");

                    try
                    {
                        Int32 iPosKeyword = docInnerText.ToLower().IndexOf(sKeyWord);
                        Int32 iPosStopWord = docInnerText.ToLower().IndexOf(sStopWord);

                        if (iPosStopWord == -1)
                        {
                            iPosStopWord = docInnerText.Length;
                        }

                        if (iPosKeyword != -1 && iPosKeyword <= iPosStopWord)
                        {
                            // Redimensions the array if there was already a document loaded
                            if (sDocumentList[0] != null)
                            {
                                Array.Resize(ref sDocumentList, sDocumentList.Length + 1);
                                Array.Resize(ref sDocumentText, sDocumentText.Length + 1);
                            }
                            sDocumentList[sDocumentList.Length - 1] = docWord.Substring(sDocPath.Length, docWord.Length - sDocPath.Length);
                            // Taking the small context around the keyword
                            sDocumentText[sDocumentText.Length - 1] = ("(...) " + docInnerText.Substring(iPosKeyword, sKeyWord.Length + 60) + " (...)");
                        }

                    }
                    catch (ArgumentOutOfRangeException)
                    {
                        Console.WriteLine("Error reading inner text.");
                    }
                }
                // Closing the package to enable opening a document right after
                officePackage.Close();
            }

            if (sDocumentList[0] != null)
            {
                // Preparing the array for output
                string[,] sFinalArray = new string[sDocumentList.Length, 2];

                for (int i = 0; i < sDocumentList.Length; i++)
                {
                    sFinalArray[i, 0] = sDocumentList[i].Replace("\\", "");
                    sFinalArray[i, 1] = sDocumentText[i];
                }
                return sFinalArray;
            }
            else 
            {
                // Preparing the array for output
                string[,] sFinalArray = new string[1, 1];
                sFinalArray[0, 0] = "NO MATCH";
                return sFinalArray;
            }
        }
    }

}

与之相关的VBA代码。

Option Explicit

Const sLibname As String = "C:\pathToYourDocuments\"

Sub tester()

Dim aFiles As Variant
Dim LookUpDir As BrowserClass.SpecificDirectory
Set LookUpDir = New BrowserClass.SpecificDirectory

' The array will contain all the files which contain the "searchedPhrase" '
aFiles = LookUpDir.LookUpWord("searchedPhrase", "stopWord", sLibname)

' Add here any necessary processing if needed '

End Sub

最终，您将获得一个比在VBA中经典的打开-读取-关闭方法更快地扫描.docx文档的工具，但需要编写更多的代码。

最重要的是，您为用户提供了一个简单的解决方案，特别是当有大量Word文档需要进行简单搜索时。

注意：

在VBA中解析Word .XML文件可能会非常困难，正如@Mikegrann所指出的那样。幸运的是，OpenXML拥有一个XML解析器C# , xml parsing. get data between tags，它可以在C#中为您完成这项工作，并获取引用到文档文本的<w:t></w:t>标记。虽然我迄今为止找到了这些答案，但无法使它们起作用：Parsing a MS Word generated XML file in C#，Reading specific XML elements from XML file。

所以我选择了我在上面提供的.InnerText解决方案，以获取内部文本，代价是有一些格式化文本输入（例如\\MERGEFORMAT）。