如何在Excel VBA正则表达式中处理负回顾后断言？

Question

如何在Excel VBA正则表达式中处理负回顾后断言？

3

Excel VBA代码正在使用正则表达式从HTML文件中提取章节号。然而，该正则表达式包含一个负回顾后断言，在VBA正则表达式中不被支持。"(?<!tbl"")>(\d(\.\d)+)<"

Sub GetAllSectionNumbers()
    LRb = Cells(Rows.Count, "B").End(xlUp).Row
    Range("B7:C" & LRb).ClearContents
    Dim fileDialog As fileDialog
    Set fileDialog = Application.fileDialog(msoFileDialogOpen)
    
    fileDialog.AllowMultiSelect = True
    fileDialog.Title = "Select HTML files"
    fileDialog.Filters.Clear
    fileDialog.Filters.Add "HTML files", "*.htm;*.html", 1
    
    If fileDialog.Show <> -1 Then Exit Sub
    
    Dim file As Variant
    For Each file In fileDialog.SelectedItems
        Dim fileContents As String
        Open file For Input As #1
        fileContents = Input$(LOF(1), 1)
        Close #1
        
        Dim regex As Object
        Set regex = CreateObject("VBScript.RegExp")
        regex.Pattern = "(?<!tbl"")>(\d(\.\d)+)<"
        regex.Global = True
        regex.IgnoreCase = True
        regex.MultiLine = True
        TRET = regex.Pattern
        filePath = file
        fileFolder = Left(filePath, InStrRev(filePath, "\"))
        fileNameSource = Mid(filePath, InStrRev(filePath, "\") + 1, 100)
    
        Dim match As Object
        Set match = regex.Execute(fileContents)
        
        Dim i As Long
        For i = 0 To match.Count - 1
            LRb = Cells(Rows.Count, "B").End(xlUp).Row + 1
    
            Range("B" & LRb).Value = match.Item(i).SubMatches(0)
            Range("C" & LRb).Value = fileNameSource
        Next i
    Next file
    MsgBox "Done!"
End Sub

有没有其他的正则表达式解决方案来处理这个问题？

- MK01111000

1

通常在这种情况下，“最佳正则表达式技巧”是匹配不需要的内容并匹配和捕获所需内容。tbl">\d(?:\.\d)+<|>(\d(?:\.\d)+)< 可以工作，只需获取捕获值（match.SubMatches(0)）。 - Wiktor Stribiżew

谢谢，那个解决方案完美地解决了我的问题。你能否把它发表为一个答案，这样我就可以接受它了？ - MK01111000

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wiktor Stribiżew · Accepted Answer

在提取数据时，传统的方法是使用 "最佳正则表达式技巧"，该技巧包括匹配您不需要的内容并匹配 并捕获 您需要的内容。

在这种特定情况下，正则表达式将如下所示：

tbl">\d(?:\.\d)+<|>(\d(?:\.\d)+)<

在代码中，它会看起来像这样：

regex.Pattern = "tbl"">\d(?:\.\d)+<|>(\d(?:\.\d)+)<"

接下来，在您的代码中，您应该检查match.SubMatches(0)值是否实际存在，如果是，则获取它，因为这是您需要的。

请参见正则表达式演示。