漂亮汤（Beautiful Soup）在该网站上无法使用。

Question

漂亮汤（Beautiful Soup）在该网站上无法使用。

pythonweb-scrapingbeautifulsouppython-requests

3

我想要抓取表格中所有项目的 URL，但尝试后没有结果。代码非常基础，因此我可以理解为什么它可能不起作用。但是，甚至尝试抓取这个网站的标题也没有结果。我至少希望能看到 h1 标签，因为它在表格之外...

网站: https://www.vanguard.com.au/personal/products/en/overview

import requests
from bs4 import BeautifulSoup


lists =[]
url = 'https://www.vanguard.com.au/personal/products/en/overview'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

title = soup.find_all('h1', class_='heading2 gbs-font-vanguard-red')
for links in soup.find_all('a', style='padding-bottom: 1px;'):
    link_text = links['href']
    lists.append(link_text)

print(title)
print(lists)

- turtle69

最常见的问题是，此页面使用 JavaScript 添加元素，但 requests/BeautifulSoup 无法运行 JavaScript。您可能需要使用 Selenium 来控制真实的网络浏览器，以便运行 JavaScript。 - furas

或者您可以尝试在Chrome/Firefox中使用“DevTools”（选项卡“网络”，过滤器“XHR”）查找JavaScript用于获取数据的URL，并在“requests”中使用此URL。 JavaScript可以将数据作为JSON获取，因此可能不需要使用BeautifulSoup。 - furas

3个回答

2

最常见的问题（在许多现代页面中）：此页面使用 JavaScript 添加元素，但 requests/BeautifulSoup 无法运行 JavaScript。

您可能需要使用 Selenium 来控制真实的网络浏览器，以便能够运行 JavaScript。

这个例子仅使用 Selenium，没有使用 BeautifulSoup。

我使用了 xpath，但你也可以使用 css 选择器。

from selenium import webdriver
from selenium.webdriver.common.by import By
             
url = 'https://www.vanguard.com.au/personal/products/en/overview'

lists = []

#driver = webdriver.Chrome(executable_path="/path/to/chromedrive.exe")
driver = webdriver.Firefox(executable_path="/path/to/geckodrive.exe")
driver.get(url)

title = driver.find_element(By.XPATH, '//h1[@class="heading2 gbs-font-vanguard-red"]')
print(title.text)

all_items = driver.find_elements(By.XPATH, '//a[@style="padding-bottom: 1px;"]')

for links in all_items:
    link_text = links.get_attribute('href')
    print(link_text)
    lists.append(link_text)

ChromeDriver（适用于 Chrome）
GeckoDriver（适用于 Firefox）

- furas

我遇到了一个错误：selenium.common.exceptions.WebDriverException: Message: 'chromedriver' 可执行文件需要在 PATH 中。 - turtle69

你需要将/full/path/to/chromedrive.exe添加到Chrome()中，或者将带有chromedrive.exe的文件夹添加到系统变量PATH中。在Linux上，我有一个名为~/bin的文件夹在PATH中，并将所有可执行文件保存在此文件夹中，因此我不必在Chrome()/Firefox()中添加它。 - furas

1

与通过Selenium获取数据相比，从源头获取数据总是更有效率的。看起来链接是通过portId创建的。

import pandas as pd
import requests


url = 'https://www3.vanguard.com.au/personal/products/funds.json'
payload = {
'context': '/personal/products/',
'countryCode': 'au.ret',
'paths': "[[['funds','legacyFunds'],'AU']]",
'method': 'get'}

jsonData = requests.get(url, params=payload).json()

results = jsonData['jsonGraph']['funds']['AU']['value']


df1 = pd.json_normalize(results, record_path=['children'])
df2 = pd.json_normalize(results, record_path=['listings'])


df = pd.concat([df1, df2], axis=0)
df['url_link'] = 'https://www.vanguard.com.au/personal/products/en/detail/' + df['portId'] + '/Overview'

输出：

print(df[['fundName', 'url_link']])
                                             fundName                                           url_link
0         Vanguard Active Emerging Market Equity Fund  https://www.vanguard.com.au/personal/products/...
1             Vanguard Active Global Credit Bond Fund  https://www.vanguard.com.au/personal/products/...
2                  Vanguard Active Global Growth Fund  https://www.vanguard.com.au/personal/products/...
3   Vanguard Australian Corporate Fixed Interest I...  https://www.vanguard.com.au/personal/products/...
4       Vanguard Australian Fixed Interest Index Fund  https://www.vanguard.com.au/personal/products/...
..                                                ...                                                ...
23  Vanguard MSCI Australian Small Companies Index...  https://www.vanguard.com.au/personal/products/...
24  Vanguard MSCI Index International Shares (Hedg...  https://www.vanguard.com.au/personal/products/...
25       Vanguard MSCI Index International Shares ETF  https://www.vanguard.com.au/personal/products/...
26  Vanguard MSCI International Small Companies In...  https://www.vanguard.com.au/personal/products/...
27  Vanguard International Credit Securities Hedge...  https://www.vanguard.com.au/personal/products/...

[66 rows x 2 columns]

- chitown88

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- JayPeerachai · Accepted Answer

如果问题是由JavaScript事件监听器引起的，我建议您使用beautifulsoup和selenium来爬取此网站。因此，让我们使用selenium发送请求并获取页面源代码，然后使用beautifulsoup进行解析。

此外，您应该使用title = soup.find()而不是title = soup.findall()以仅获取一个标题。

以下是使用Firefox的代码示例：

from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from bs4 import BeautifulSoup


url = 'https://www.vanguard.com.au/personal/products/en/overview'
browser = webdriver.Firefox(executable_path=GeckoDriverManager().install())
browser.get(url)

soup = BeautifulSoup(browser.page_source, 'html.parser')
browser.close()

lists =[]
title = soup.find('h1', class_='heading2 gbs-font-vanguard-red')
for links in soup.find_all('a', style='padding-bottom: 1px;'):
    link_text = links['href']
    lists.append(link_text)

print(title)
print(lists)

输出：

<h1 class="heading2 gbs-font-vanguard-red">Investment products</h1>
['/personal/products/en/detail/8132', '/personal/products/en/detail/8219', '/personal/products/en/detail/8121',...,'/personal/products/en/detail/8217']