在R中对交互式aspx网站进行网络爬虫

Question

在R中对交互式aspx网站进行网络爬虫

rweb-scrapingrcurl

8

我正在尝试从一个交互式aspx网页中爬取表格。我已经阅读了堆栈上所有关于R网络爬虫的问题，我认为我接近成功了，但似乎还差一点。

我想从这里生成的表格中提取数据。这里。最终，我想循环遍历每个日期段和州选项，但我的挑战实际上只是让R提交我的参数并拉取任何特定查询的结果表。

据我所知，答案可能涉及RCurl和XML包，使用我的参数发布“表单”，然后读取生成页面的HTML。

我最近的努力看起来像这样:

library(RCurl)
library(XML)

curl = getCurlHandle()

link = "http://indiawater.gov.in/IMISReports/Reports/WaterQuality/rpt_WQM_HabitationWiseLabTesting_S.aspx"

html = getURL(link, curl = curl)

params = list('ctl00$ContentPlaceHolder$ddFinYear' = '2005-2006',
              'ctl00$ContentPlaceHolder$ddState' = 'BIHAR')

html2 = postForm(link, .params = params, curl = curl)

table = readHTMLTable(html2 )

对我来说很难确定何时出现了问题。一方面，html == html2的结果为false，因此我认为在提交表单后，html2已经进行了某些进展，但我仍然不清楚表单是否被错误提交或者表单提交成功但读取表格时发生了错误。

非常感谢任何建议和帮助！谢谢！

- DaedalusBloom

我试图访问网站http://indiawater.gov.in/IMISReports/Reports/WaterQuality/rpt_WQM_HabitationWiseLabTesting_S.aspx，但似乎该网站已不再可用。 - Emmanuel Hamel

看起来数据已经移动到这里 https://ejalshakti.gov.in/IMISReports/Reports/Physical/rpt_RWS_TargetAchievement_S.aspx?Rep=0&RP=Y&APP=IMIS - Artem

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Emmanuel Hamel · Answer 1

我已经能够使用以下代码提取表格的内容：

library(RDCOMClient)
library(stringr)

url <- "https://ejalshakti.gov.in/IMISReports/Reports/Physical/rpt_RWS_TargetAchievement_S.aspx?Rep=0&RP=Y&APP=IMIS"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)

doc <- IEApp$Document()
mouseEvent <- doc$createEvent("MouseEvent")
mouseEvent$initEvent("click", TRUE, FALSE)

web_Obj_Date <- doc$getElementByID("ContentPlaceHolder_ddfinyear")
web_Obj_Date[['Value']] <- "2015-2016"

web_Obj_Submit <- doc$getElementByID("ContentPlaceHolder_btnGO")
web_Obj_Submit$dispatchEvent(mouseEvent) 

Sys.sleep(5)
html_Content <- doc$documentElement()$innerText()

text_Table <- stringr::str_extract_all(string = html_Content, pattern = "Financial Year:((.|\\r\\n)*)Disclaimer and Privacy Policy")[[1]]
strsplit(text_Table, "\r\n")[[1]]