使用特定类的div标签抓取所有内容

Question

使用特定类的div标签抓取所有内容

15

我正在从一个特定的

类中抓取网站上的所有文本。在以下示例中，我想提取所有在 class="a" 的

中的内容。

site <- "<div class='a'>Hello, world</div>
  <div class='b'>Good morning, world</div>
  <div class='a'>Good afternoon, world</div>"

我期望的输出是...

"Hello, world"
"Good afternoon, world"

下面的代码提取了每个 div 中的文本，但我不知道如何只包括 class="a"。

``` 下面的代码提取了每个

元素中的文本，但是我无法想出如何仅包含 class="a" 的元素。 ```

library(tidyverse)
library(rvest)

site %>% 
  read_html() %>% 
  html_nodes("div") %>% 
  html_text()

# [1] "Hello, world"          "Good morning, world"   "Good afternoon, world"

使用Python的BeautifulSoup，代码可能看起来像这样：site.find_all("div", class_="a")。

- Andrew Brēza

2个回答

6

site %>% 
  read_html() %>% 
  html_nodes(xpath = '//*[@class="a"]') %>% 
  html_text()

- DJack

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- neilfws · Accepted Answer

具有 class = "a" 的 div 的 CSS 选择器是 div.a：

site %>% 
  read_html() %>% 
  html_nodes("div.a") %>% 
  html_text()

或者您可以使用XPath：

html_nodes(xpath = "//div[@class='a']")