我用VBA作为工具来爬取一些网站,使用XMLHTTP和HTMLDocument(因为它比internetExplorer.Application更快)。
Public Sub XMLhtmlDocumentHTMLSourceScraper()
Dim XMLHTTPReq As Object
Dim htmlDoc As HTMLDocument
Dim postURL As String
postURL = "http://foodffs.tumblr.com/archive/2015/11"
Set XMLHTTPReq = New MSXML2.XMLHTTP
With XMLHTTPReq
.Open "GET", postURL, False
.Send
End With
Set htmlDoc = New HTMLDocument
With htmlDoc
.body.innerHTML = XMLHTTPReq.responseText
End With
i = 0
Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass")
For Each vr In varTemp
''''the next line is important to solve this issue *1
Cells(1, 1) = vr.outerHTML
Set varTemp2 = vr.getElementsByTagName("SPAN class=post_date")
Cells(i + 1, 3) = varTemp2.Item(0).innerText
''''the next line occur 438Error''''
Set varTemp2 = vr.getElementsByClassName("hover_inner")
Cells(i + 1, 4) = varTemp2.innerText
i = i + 1
Next vr
End Sub
我通过 *1 解决了这个问题 cells(1,1) 展示给我下面的东西
<DIV class="post_glass post_micro_glass" title=""><A class=hover title="" href="http://foodffs.tumblr.com/post/134291668251/sugar-free-low-carb-coffee-ricotta-mousse-really" target=_blank>
<DIV class=hover_inner><SPAN class=post_date>...............
是的,所有的类标签都失去了“ ”,只有第一个函数的类有“ ”
//我真的不知道为什么会出现这种情况。
//虽然我可以通过getElementsByTagName(“span”)来解析,但我更喜欢“class”标签.....
Cells(i + 1, 4) = varTemp2.innerText
应该改为Cells(i + 1, 4) = varTemp2(0).innerText
。即使getElementsByClassName
返回的是一个只包含一个元素的集合,它也不是单个对象。 - user4039065querySelectorAll
在HTMLDocument
实例上对我有效。 - Tim Williams