VBA、getElementsByClassName和HTMLSource的双引号消失了

4

我用VBA作为工具来爬取一些网站,使用XMLHTTP和HTMLDocument(因为它比internetExplorer.Application更快)。

Public Sub XMLhtmlDocumentHTMLSourceScraper()

    Dim XMLHTTPReq As Object
    Dim htmlDoc As HTMLDocument

    Dim postURL As String

    postURL = "http://foodffs.tumblr.com/archive/2015/11"

        Set XMLHTTPReq = New MSXML2.XMLHTTP

        With XMLHTTPReq
            .Open "GET", postURL, False
            .Send
        End With

        Set htmlDoc = New HTMLDocument
        With htmlDoc
            .body.innerHTML = XMLHTTPReq.responseText
        End With

        i = 0

        Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass")

        For Each vr In varTemp
            ''''the next line is important to solve this issue *1
            Cells(1, 1) = vr.outerHTML
            Set varTemp2 = vr.getElementsByTagName("SPAN class=post_date")
            Cells(i + 1, 3) = varTemp2.Item(0).innerText
            ''''the next line occur 438Error''''
            Set varTemp2 = vr.getElementsByClassName("hover_inner")
            Cells(i + 1, 4) = varTemp2.innerText

            i = i + 1

        Next vr
End Sub

我通过 *1 解决了这个问题 cells(1,1) 展示给我下面的东西
<DIV class="post_glass post_micro_glass" title=""><A class=hover title="" href="http://foodffs.tumblr.com/post/134291668251/sugar-free-low-carb-coffee-ricotta-mousse-really" target=_blank>
<DIV class=hover_inner><SPAN class=post_date>...............

是的,所有的类标签都失去了“ ”,只有第一个函数的类有“ ”

//我真的不知道为什么会出现这种情况。

//虽然我可以通过getElementsByTagName(“span”)来解析,但我更喜欢“class”标签.....


我认为HTML在属性值不包含空格时不需要引号,当您查看outerHTML时所看到的反映了IE对此的表示。但这可能不是您遇到错误的原因。 - Tim Williams
谢谢大家!!@TimWilliams 我明白了。那么getElementsByTagName("span")是我能解析innerText的唯一方法吗? - Soborubang
@barrowc 抱歉,你的代码出现了相同的错误信息 :( 但是,感谢你的帮助! - Soborubang
1
我现在正在审查你的代码,寻找替代方案,但至少Cells(i + 1, 4) = varTemp2.innerText应该改为Cells(i + 1, 4) = varTemp2(0).innerText。即使getElementsByClassName返回的是一个只包含一个元素的集合,它也不是单个对象。 - user4039065
1
@Jeeped - querySelectorAllHTMLDocument 实例上对我有效。 - Tim Williams
显示剩余4条评论
2个回答

5

getElementsByClassName 方法 不被视为自身的方法,而只是父级 HTMLDocument 的方法。如果你想使用它来定位 DIV 元素中的元素,则需要创建一个由特定 DIV 元素的 .outerHtml 组成的子 HTMLDocument。

Public Sub XMLhtmlDocumentHTMLSourceScraper()

    Dim xmlHTTPReq As New MSXML2.XMLHTTP
    Dim htmlDOC As New HTMLDocument, divSUBDOC As New HTMLDocument
    Dim iDIV As Long, iSPN As Long, iEL As Long
    Dim postURL As String, nr As Long, i As Long

    postURL = "http://foodffs.tumblr.com/archive/2015/11"

    With xmlHTTPReq
        .Open "GET", postURL, False
        .Send
    End With

    'Set htmlDOC = New HTMLDocument
    With htmlDOC
        .body.innerHTML = xmlHTTPReq.responseText
    End With

    i = 0

    With htmlDOC
        For iDIV = 0 To .getElementsByClassName("post_glass post_micro_glass").Length - 1
            nr = Sheet1.Cells(Rows.Count, 3).End(xlUp).Offset(1, 0).Row
            With .getElementsByClassName("post_glass post_micro_glass")(iDIV)
                'method 1 - run through multiples in a collection
                For iSPN = 0 To .getElementsByTagName("span").Length - 1
                    With .getElementsByTagName("span")(iSPN)
                        Select Case LCase(.className)
                            Case "post_date"
                                Cells(nr, 3) = .innerText
                            Case "post_notes"
                                Cells(nr, 4) = .innerText
                            Case Else
                                'do nothing
                        End Select
                    End With
                Next iSPN
                'method 2 - create a sub-HTML doc to facilitate getting els by classname
                divSUBDOC.body.innerHTML = .outerHTML  'only the HTML from this DIV
                With divSUBDOC
                    If CBool(.getElementsByClassName("hover_inner").Length) Then 'there is at least 1
                        'use the first
                        Cells(nr, 5) = .getElementsByClassName("hover_inner")(0).innerText
                    End If
                End With
            End With
        Next iDIV
    End With

End Sub

虽然其他的.getElementsByXXXX方法可以轻松地在另一个元素中检索集合,但getElementsByClassName方法需要考虑它认为整个HTMLDocument,即使你已经欺骗它认为不是。


非常感谢!我不知道getElementsByClassName是特殊的。真的,我很钦佩你! - Soborubang
MDN中提到:"您还可以在任何元素上调用getElementsByClassName();它将仅返回指定根元素的后代具有给定类名的元素。" 我很确定我以前在IE中就是这样使用过它... - Tim Williams
话虽如此,除非在“document”对象上调用,否则它对我在VBA中不起作用。;-[ - Tim Williams
@TimWilliams - 上面链接的MDSN文档中有一些错误。它们说它具有类型为Element的返回值,但实际上应该是类似于HtmlElementCollection对象的东西。顺便说一下,我看过几个VBA HTMLDocument方法行为的示例,它们与标准规定的预期不太相符(而javascript则可以正确执行)。 - user4039065

1

这里有一种替代方法。它与原始代码非常相似,但使用querySelectorAll来选择相关的span元素。对于这种方法的一个重要点是,vr必须被声明为特定的元素类型,而不是IHTMLElement或通用对象:

Option Explicit

Public Sub XMLhtmlDocumentHTMLSourceScraper()

' Changed from generic Object to specific type - not
' strictly necessary to do this
Dim XMLHTTPReq As MSXML2.XMLHTTP60
Dim htmlDoc As HTMLDocument

' These declarations weren't included in the original code
Dim i As Integer
Dim varTemp As Object
' IMPORTANT: vr must be declared as a specific element type and not
' as an IHTMLElement or generic Object
Dim vr As HTMLDivElement
Dim varTemp2 As Object

Dim postURL As String

postURL = "http://foodffs.tumblr.com/archive/2015/11"

' Changed from XMLHTTP to XMLHTTP60 as XMLHTTP is equivalent
' to the older XMLHTTP30
Set XMLHTTPReq = New MSXML2.XMLHTTP60

With XMLHTTPReq
    .Open "GET", postURL, False
    .Send
End With

Set htmlDoc = New HTMLDocument
With htmlDoc
    .body.innerHTML = XMLHTTPReq.responseText
End With

i = 0

Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass")

For Each vr In varTemp
   ''''the next line is important to solve this issue *1
   Cells(1, 1) = vr.outerHTML

   Set varTemp2 = vr.querySelectorAll("span.post_date")
   Cells(i + 1, 3) = varTemp2.Item(0).innerText

   Set varTemp2 = vr.getElementsByClassName("hover_inner")
   ' incorporating correction from Jeeped's comment (#56349646)
   Cells(i + 1, 4) = varTemp2.Item(0).innerText

   i = i + 1
Next vr

End Sub

注:

  • XMLHTTP相当于这里所描述的XMLHTTP30
  • 明显需要在此问题中声明特定元素类型,但与getElementsByClassName不同,querySelectorAll在任何版本的IHTMLElement中都不存在

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接