访问两个特定标题之间的所有li元素

3

问题描述:

我正在寻找一种方法仅访问两个特定标题标签之间的 li 元素(例如从第2个 h3 到第 3 个 h3,或从第 3 个 h3 到下一个 h4),以便根据标题中提到的标准创建列在 https://de.wikipedia.org/wiki/1._Januar 上列出的历史事件表。 对我而言,主要问题是 - 除了 h1 标题之外,较低级别的字幕没有 classNameid

HTML示例:

<div class="mw-parser-output">
  [...]
  </h3>
  <ul>
    <li><a href="/wiki/153_v._Chr." title="153 v. Chr.">153 v. Chr.</a>: Die <a href="/wiki/Consulat" title="Consulat">Konsuln</a> der <a href="/wiki/R%C3%B6mische_Republik" title="Römische Republik">römischen Republik</a> beginnen ihre Amtszeit erstmals
      am 1. Januar statt am 1. März; daher ist der 1. Januar heute der Jahresanfang.</li>
    <li><span style="visibility:hidden;">0</span><a href="/wiki/45_v._Chr." title="45 v. Chr.">45 v. Chr.</a>: <a href="/wiki/Kalenderreform_des_Gaius_Iulius_Caesar" title="Kalenderreform des Gaius Iulius Caesar">Caesars Reform</a> des <a href="/wiki/R%C3%B6mischer_Kalender"
        title="Römischer Kalender">römischen Kalenders</a> endet. Dieser wird ab 2. Januar 709 <a href="/wiki/Ab_urbe_condita_(Chronologie)" title="Ab urbe condita (Chronologie)">a. u. c.</a> durch den <a href="/wiki/Julianischer_Kalender" title="Julianischer Kalender">julianischen Kalender</a>      ersetzt.</li>
    <li><span style="visibility:hidden;">0</span><a href="/wiki/32_v._Chr." title="32 v. Chr.">32 v. Chr.</a>: <a href="/wiki/Augustus" title="Augustus">Oktavian</a> lässt sich vom <a href="/wiki/R%C3%B6mischer_Senat" title="Römischer Senat">Senat</a> zum
      „Führer Italiens“ (<i><a href="/wiki/Dux_(Titel)" title="Dux (Titel)">dux Italiae</a></i>) ausrufen. Er erklärt <a href="/wiki/Kleopatra_VII." title="Kleopatra VII.">Kleopatra</a> und damit <i><a href="/wiki/De_jure/de_facto" title="De jure/de facto">de facto</a></i>      auch <a href="/wiki/Marcus_Antonius" title="Marcus Antonius">Marcus Antonius</a> den Krieg.</li>
  </ul>
  [...]
  </ul>
  <h4><span id="Inkrafttreten_von_Gesetzen_und_Staatsvertr.C3.A4gen"></span><span class="mw-headline" id="Inkrafttreten_von_Gesetzen_und_Staatsverträgen">Inkrafttreten von Gesetzen und Staatsverträgen</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>
    <a href="/w/index.php?title=1._Januar&amp;veaction=edit&amp;section=3" class="mw-editsection-visualeditor" title="Abschnitt bearbeiten: Inkrafttreten von Gesetzen und Staatsverträgen">Bearbeiten</a><span class="mw-editsection-divider"> | </span>
    <a href="/w/index.php?title=1._Januar&amp;action=edit&amp;section=3" title="Abschnitt bearbeiten: Inkrafttreten von Gesetzen und Staatsverträgen">Quelltext bearbeiten</a><span class="mw-editsection-bracket">]</span></span>
  </h4>
  <p><i>Der 1. Januar wird oft für das Inkrafttreten von Gesetzen und Staatsverträgen verwendet. Das gilt unter anderem für:</i>
  </p>
  <ul>
    <li><a href="/wiki/1812" title="1812">1812</a>: das <i><a href="/wiki/Allgemeines_b%C3%BCrgerliches_Gesetzbuch" title="Allgemeines bürgerliches Gesetzbuch">Allgemeine bürgerliche Gesetzbuch</a></i> <i>(ABGB)</i> in den <a href="/wiki/Habsburgermonarchie#Erblande"
        title="Habsburgermonarchie">habsburgischen Erblanden</a>.</li>
  </ul>
  [...]
  </h4>
  <p><i>Folgende Staaten erhalten am 1. Januar ihre Unabhängigkeit:</i>
  </p>
  <ul>
    [...]
  </ul>
  <h3><span class="mw-headline" id="Wirtschaft">Wirtschaft</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=1._Januar&amp;veaction=edit&amp;section=6" class="mw-editsection-visualeditor" title="Abschnitt bearbeiten: Wirtschaft">Bearbeiten</a>
    <span class="mw-editsection-divider"> | </span><a href="/w/index.php?title=1._Januar&amp;action=edit&amp;section=6" title="Abschnitt bearbeiten: Wirtschaft">Quelltext bearbeiten</a><span class="mw-editsection-bracket">]</span></span>
  </h3>
  <h4><span class="mw-headline" id="Wichtige_Ereignisse_in_der_Weltwirtschaft">Wichtige Ereignisse in der Weltwirtschaft</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=1._Januar&amp;veaction=edit&amp;section=7"
      class="mw-editsection-visualeditor" title="Abschnitt bearbeiten: Wichtige Ereignisse in der Weltwirtschaft">Bearbeiten</a><span class="mw-editsection-divider"> | </span><a href="/w/index.php?title=1._Januar&amp;action=edit&amp;section=7" title="Abschnitt bearbeiten: Wichtige Ereignisse in der Weltwirtschaft">Quelltext bearbeiten</a>
    <span class="mw-editsection-bracket">]</span>
    </span>
  </h4>
  <ul>
    <li><a href="/wiki/1780" title="1780">1780</a>: In <a href="/wiki/Geschichte_Bratislavas" title="Geschichte Bratislavas">Preßburg</a> erscheint die erste ungarische Zeitung <i>Magyar hírmondó</i> („Ungarischer Kurier“).</li>

到目前为止,我仅使用以下代码成功访问了所有未包含在目录表中的li元素(超过1000个!):

实验性代码示例:


Sub HistoricalEvents_Test()
    Dim http   As Object, html As New MSHTML.HTMLDocument
    Dim oLiList As MSHTML.IHTMLDOMChildrenCollection
    Dim data   As String
    Dim r      As Integer
    Dim oWord  As Object, oWordDoc As Object
    Dim wordApp As New Word.Application
    Dim iFirstRow As Integer, iLastRow As Integer

    Set http = CreateObject("MSXML2.XMLHTTP")
    http.Open "GET", "https://de.wikipedia.org/wiki/1._Januar", False
    http.send
    html.body.innerHTML = http.responseText
   
      
    Dim lLiResultList As Long
    Dim lLiResultLoop As Long
    
    Set oLiList = html.querySelectorAll("#toc ~ ul li")
      
    For lLiResultLoop = 0 To oLiList.Length - 1
        Dim oLiChild As Object
        Set oLiChild = oIlList.Item(lLilResultLoop)
            data = oLiChild.innerText   'data = data & vbCrLf & oLiChild.innerText
            Range("B" & lLiResultLoop +1).Value = data
            data = vbNullString
    Next lLiResultLoop
    
    
    Dim j      As Long
    Dim Ws As Worksheet
    Dim rngDB As Range
    Set Ws = ActiveSheet
    Set oWord = CreateObject("Word.Application")

    Set oWordDoc = oWord.Documents.Open("D:\Jahrestage Geschichte.docx")
    iFirstRow = 1             ' "Ws.Cells(1, 2).End(xlDown).Row" used to work fine but suddenly gives same as iLastRow!
    'Debug.Print iFirstRow
    iLastRow = Ws.Cells(ActiveSheet.Rows.Count, "B").End(xlUp).Row
    'Debug.Print iLastRow
    oWord.Visible = True
    
    With wordApp
        With Ws
            Set rngDB = Ws.Range(.Cells(iFirstRow, 2), .Cells(iLastRow, 2))
        End With
        rngDB.Cut
            oWord.Selection.PasteSpecial DataType:=wdPasteText
            oWord.Selection.TypeParagraph
            oWord.Selection = ""
    End With
    
    oWordDoc.Close savechanges:=True
    wordApp.Quit                                  'it doesn't :(
    
End Sub

总体构思/最终项目描述

最终项目应该为每个月准备一个工作表,每个工作表都包含一个表格,其中每一行对应该月的每一天,列则对应相应的(子)标题的不同类别。代码中的Word输出只是一个初步的副产品,只有在解决主要问题时才会完成它。

进一步说明

这是我在SO上的第一次贡献。当涉及到vba和网络爬虫(或任何编码、脚本或编程方面)时,我是一个绝对的初学者,但我被它吸引住了,并且用我的整个寒假时间来弄清楚上述代码。即使没有像SO的高手分享给像我这样的新手无价的知识,我也无法完成那个可怜的脚本。我尝试过各种方法,但总是卡在某些地方,VBA触发运行时错误,Excel经常崩溃。特别是,我无法成功实现nextSibling/previousSibling方法或nodeName选择器,我认为这可能是解决问题的一种有希望的方法。所以,任何帮助或提示都将不胜感激!

有效的解决方案:

由于对我的问题的反馈,我终于设法找到了一个解决方案,虽然可能不是最优雅的方式。唯一剩下的问题是奇怪的是,最后一列的li元素被复制了。所以如果有人知道如何处理这个问题......

Sub SliceHtmlByHeaderTypes4()

    Dim http   As Object, html As MSHTML.HTMLDocument
    Dim sh     As Worksheet
    Set sh = ThisWorkbook.ActiveSheet
    
    Set http = CreateObject("MSXML2.XMLHTTP"): Set html = New MSHTML.HTMLDocument
    http.Open "GET", "https://de.wikipedia.org/wiki/1._Januar", False
    http.send
    html.body.innerHTML = http.responseText
    
    Dim hNodeList As Object
    Dim startPos As Long, endPos As Long
    Dim s      As Integer, e As Integer
    
    Set hNodeList = html.querySelectorAll("#toc ~ h2, #toc ~ h3, #toc ~ h4")
    Debug.Print hNodeList.Length
    
    Do While s < hNodeList.Length - 1
        http.Open "GET", "https://de.wikipedia.org/wiki/1._Januar", False
        http.send
        html.body.innerHTML = http.responseText
        Set hNodeList = html.querySelectorAll("#toc ~ h2, #toc ~ h3, #toc ~ h4")
        startPos = InStr(html.body.outerHTML, hNodeList.Item(s).outerHTML)
        endPos = InStr(html.body.outerHTML, hNodeList.Item(s + 1).outerHTML)
        
        If startPos > 0 And endPos > 0 And endPos > startPos Then
            Dim strS  As String
            strS = Mid$(html.body.outerHTML, startPos, endPos - startPos + 1)
        Else
            MsgBox "Problem slicing string"
            Stop
            Exit Sub
        End If
        
        Dim liList As Object
        
        html.body.innerHTML = strS
        
        Set liList = html.getElementsByTagName("li")
        
        If liList.Length > 0 Then
            Dim i      As Integer
            Dim liText As String
            Dim lc As Integer
            Dim liRange As Range
            lc = (Cells(2, Columns.Count).End(xlToLeft).Column) + 1
            Set liRange = sh.Range(Cells(2, lc), Cells(2, lc))
            
            For i = 0 To liList.Length - 1
                On Error Resume Next
                liText = liList.Item(i).innerText
                liRange.Value = liRange.Value & liText & vbNewLine
                liText = vbNullString
            Next i
            
            strS = vbNullString
            startPos = 0
            endPos = 0
            hNodeList = ""
            i = 0
        End If
        s = s + 1
    Loop
End Sub

1
那么上述内容的期望输出是什么样子的?当你说“(例如从第二个h3到第三个h3或从第三个h3到下一个h4)”时,你是否希望相同的代码适用于两种情况/其他情况?如果该部分是“政治和世界事务”,它最终会是什么样子?我猜你是指选择标题标签之间的部分? - QHarr
2
HTML文档只是一个文本,其中元素是唯一标识的。因此,您可以使用INSTR函数搜索一个元素,然后搜索另一个元素,它们之间的字符串包含它们之间的所有元素。 - Variatus
1
是的。您可以找到标题1的起始位置并在那里切片,然后在下一个标题的开头再次切片,以便只提取1个HTML部分。您可能需要调整HTML以使其有效地进行解析,例如通过确保在标题标签<h1.....的开头剪切并捕获关闭</h1>等...将其传递给HTMLDocument的.body.innerHTML,并使用内置的HTML解析器。 - QHarr
1
抱歉,我误读了那一行。是我的错。看起来你想要 html.querySelectorAll("#toc ~ ul") - QHarr
1
没问题!你的新 html.querySelectorAll("#toc ~ ul") 完美地运行了。顺便说一下,非常感谢你,这种使用“~”的方法对我来说是全新的。多么方便啊! - slintezgeu
显示剩余8条评论
2个回答

1

不必使用循环,您可以一次性复制并粘贴范围。

Sub HistoricalEvents_Test()
    Dim http   As Object, html As New MSHTML.HTMLDocument
    Dim oUlList As MSHTML.IHTMLDOMChildrenCollection, oLiList As MSHTML.IHTMLDOMChildrenCollection
    Dim data   As String
    Dim r      As Integer
    Dim oWord  As Word.Application ' Object
    Dim oWordDoc As New Word.document ' Object
    Dim wordApp As New Word.Application
    Dim iFirstRow As Integer, iLastRow As Integer

    Set http = CreateObject("MSXML2.XMLHTTP")
    http.Open "GET", "https://de.wikipedia.org/wiki/1._Januar", False
    http.send
    html.body.innerHTML = http.responseText
    
    'this works pretty fast, but a little reformatting on the Word document (yet to be implemented) is needed:
      
    Dim lUlResultList As Long
    Dim lUlResultLoop As Long
    
    Set oUlList = html.querySelectorAll("div.mw-parser-output ul")
    
    lUlResultList = oUlList.Length
    
    For lUlResultLoop = 0 To oUlList.Length - 1
        Dim oUlChild As Object
        Set oUlChild = oUlList.Item(lUlResultLoop)
        If Not oUlChild.FirstChild.className Like "*toclevel*" Then
            data = oUlChild.innerText             'data = data & vbCrLf & oUlChild.innerText
            Range("B" & lUlResultLoop).Value = data
            data = vbNullString
        End If
    Next lUlResultLoop
    
    
    'this works as well, no reformatting needed, but pasting to Word is much slower:
    
    '    Dim lLiResultList As Long
    '    Dim lLiResultLoop As Long
    
    '    Set oLiList = html.querySelectorAll("div.mw-parser-output ul li")
    
    '    lLiResultList = oLiList.Length
    
    '    For lLiResultLoop = 0 To oLiList.Length - 1
    '        Dim oLiChild As Object
    '        Set oLiChild = oLiList.Item(lLiResultLoop)
    '        If Not oLiChild.className Like "*toclevel*" Then
    '            data = oLiChild.innerText           'data = data & vbCrLf & oLiChild.innerText
    '            Range("B" & lLiResultLoop).Value = data
    '            data = vbNullString
    '        End If
    '    Next lLiResultLoop
    
    '********************************************************************************************
    
    Dim j      As Long
    Dim Ws As Worksheet
    Dim rngDB As Range
    Set Ws = ActiveSheet
    Set oWord = CreateObject("Word.Application")

    Set oWordDoc = oWord.Documents.Open("D:\Jahrestage Geschichte.docx")
    'Set oWordDoc = oWord.Documents.Add
    iFirstRow = Ws.Cells(1, 2).End(xlDown).Row
    'Debug.Print iFirstRow
    iLastRow = Ws.Cells(ActiveSheet.Rows.Count, "B").End(xlUp).Row
    'Debug.Print iLastRow
    oWord.Visible = True
    
    With wordApp
        With Ws
            Set rngDB = Ws.Range(.Cells(iFirstRow, 2), .Cells(iLastRow, 2))
        End With
        rngDB.Cut
            oWord.Selection.PasteSpecial DataType:=wdPasteText
            oWord.Selection.TypeParagraph
'        For r = iFirstRow To iLastRow
'            Range(Cells(r, 2), Cells(r, 2)).Cut
'            oWord.Selection.PasteSpecial DataType:=wdPasteText
'            oWord.Selection.TypeParagraph
'            For j = 1 To 4
'                Dim t  As Double
'                t = Timer
'                Do Until Timer - t >= 0.4         'can't go faster or error 4605 occurs!
'                    DoEvents
'                Loop
'            Next
            oWord.Selection.TypeParagraph
            oWord.Selection = ""
        'Next r
    End With
    
    oWordDoc.Close savechanges:=True
    wordApp.Quit                                  'it doesn't :(
    
End Sub

太棒了!你的解决方案完美地运行,而且速度非常快!非常感谢你! - slintezgeu
不确定我接受这个答案是否涉及到Word导出方面,是否意味着我的主要问题——在两个标签之间选择元素的问题也得到了回答。实际上,它还没有完全解决,但我已经收到了很多有价值的反馈,可以继续开发我的项目。所以感谢大家的反馈! - slintezgeu

1
这是我所说的缩小HTML范围到特定标题之间的示例(基于位置和标题类型,例如h2标题,第一和第二个)。使用CSS选择器确保h2列表仅包括所需的h2元素列表,然后从中选择位置,即挑选第一个和第二个项目;outerHTML用于确保我在元素的html开头切片,以保留标签结构以便插入回HTML.body.innerHTML。然后,我使用子集html内容查询HTMLDocument以获取li元素。希望您能看出如何扩展此原则。为了混合标题类型(注意CSS可选择性和排序),可以使用OR语法在开始时检索混合标题节点列表,例如h2和h3将是html.querySelectorAll(“#toc〜h2,#toc〜h3”)。使用后一种方法时,请非常小心并验证返回的节点列表是否按预期顺序排列。
Option Explicit

Public Sub SliceHtmlByHeaderTypes()
    
    Dim http As Object, html As MSHTML.HTMLDocument
    
    Set http = CreateObject("MSXML2.XMLHTTP"): Set html = New MSHTML.HTMLDocument
    http.Open "GET", "https://de.wikipedia.org/wiki/1._Januar", False
    http.send
    html.body.innerHTML = http.responseText
    
    Dim hNodeList As Object
    
    Set hNodeList = html.querySelectorAll("#toc ~ h2")
    
    Dim startPos As Long, endPos As Long
    
    startPos = InStr(html.body.outerHTML, hNodeList.Item(0).outerHTML) ' we wanna split between 1st and 2nd i.e. indices 0 and 1
    endPos = InStr(html.body.outerHTML, hNodeList.Item(1).outerHTML)
    
    Debug.Print hNodeList.Item(0).innerText
    Debug.Print hNodeList.Item(1).innerText
    
    If startPos > 0 And endPos > 0 And endPos > startPos Then
        Dim s As String
        s = Mid$(html.body.outerHTML, startPos, endPos - startPos + 1)
    Else
       MsgBox "Problem slicing string"
       Exit Sub
    End If
    
    Dim liList As MSHTML.IHTMLElementCollection
    
    html.body.innerHTML = s  'replace html content with new spliced content
    
    Set liList = html.getElementsByTagName("li") 'then do something with list of lis
    
    Stop

End Sub

请记住,有许多维基API可以提供您需要的内容,例如:

https://dev59.com/V3RB5IYBdhLWcg3wbGtB


1
这非常有帮助!正是我在寻找的那种示例,并且解释得连像我这样的初学者都能理解。我将根据这个输入继续我的项目。非常感谢!顺便说一句,我不得不将liList的声明更改为“Object”,否则会触发运行时错误13。虽然它似乎不影响功能。 - slintezgeu
由于某种无法解释的原因,querySelectorAll不再返回任何匹配项,而hNodeList.Length始终为空。我没有对代码进行任何更改,所以我真的不知道为什么它突然停止工作。网站的HTML在我看来似乎没有改变。您有什么想法吗? - slintezgeu
1
你用 html.getElementsByClassName("mw-parser-output")(0).getElementsByTagName("h2") 找到匹配了吗? - QHarr
没问题。hNodeList.Length是6,而liList包含40个项目。 - slintezgeu
1
好的,我会使用那个方法,直到我找出为什么 querySelectorAll 方法突然停止工作。谢谢! - slintezgeu
显示剩余4条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接