我正在尝试从这个网站创建一个颜色ID、描述和日期的数据框,该网站通过下拉菜单输入日期和月份,并返回一个动态JS生成的页面。我是新手,认为这将是一个有趣的玩具项目。我想使用RSelenium自动化下拉选择,并使用rvest来抓取生成的内容。我希望得到的数据框结构如下:
我正在尝试首先使用for循环来遍历一年中的每个月的某一天,然后逐步获取每个月的每一天。但我卡在了仅仅让循环遍历每个月,并获取内容上。对于这部分任务,我需要一些概念性帮助,感谢任何洞见!
description, date, meta
"paragraph about birthday", Jun 01, "DAFFODIL PANTONE 17-1512 POWERFUL KNOWING EXPRESSIVE"
我正在尝试首先使用for循环来遍历一年中的每个月的某一天,然后逐步获取每个月的每一天。但我卡在了仅仅让循环遍历每个月,并获取内容上。对于这部分任务,我需要一些概念性帮助,感谢任何洞见!
library(RSelenium)
library(rvest)
library(tidyverse)
library(xml2)
## first run: docker run -d -p 4445:4444 selenium/standalone-chrome
## open a new connection to Chrome
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("https://www.pantone.com/pages/iphone/iphone_colorstrology.html#___1__") #Entering our URL gets the browser to navigate to the page
remDr$screenshot(display = TRUE)
#### create list of month/days
month_day<- read_html(remDr$getPageSource()[[1]])
page_i <- month_day %>%
html_nodes(".list") %>%
html_children() %>%
html_text()
months <- page_i[1:12]
months <- (paste("'", months,"'", sep=''))
days <- page_i[13:43]
days <- as.numeric(days)
## create an object for month xpath elements
for (m in months){
elements <- paste0("//option[contains(text(),",months,")]")
}
## attempt at loop
total <- data.frame()
for (e in elements){
remDr$navigate("https://www.pantone.com/pages/iphone/iphone_colorstrology.html#___1__")
print(e)
month <- remDr$findElement(using = 'xpath', e)
month$clickElement()
day <- remDr$findElement(using = 'xpath', "//select[@id='lstDay']//option[5]") ## arbitrarily picking the 5th of each month
day$clickElement()
submit <- remDr$findElement(using = 'xpath', "/html[1]/body[1]/form[1]/div[1]/a[1]")
submit$clickElement()
html <- read_html(remDr$getPageSource()[[1]])
description <- html %>% html_nodes(xpath = "//tr//tr[2]//td[1]") %>% html_text() %>% gsub("^\\s+|\\s+$", "", .)
meta <- html %>% html_nodes(xpath = "//td[@id='tdBg']") %>% html_text() %>% gsub("^\\s+|\\s+$", "", .)
date <- html %>% html_nodes(xpath = "//td[@id='bgHeaderDate']//div") %>% html_text() %>% gsub("^\\s+|\\s+$", "", .)
df <- data.frame(cbind(description,meta,date))
total <- rbind(total, df)
}
没有出现任何错误,但每次结果都出乎意料。有时候会重复单个月/日的组合,比如Jan05*12次或jan05 * 3次,Apr 05 *3次等。