漂亮汤(Beautiful Soup)在该网站上无法使用。

3
我想要抓取表格中所有项目的 URL,但尝试后没有结果。代码非常基础,因此我可以理解为什么它可能不起作用。但是,甚至尝试抓取这个网站的标题也没有结果。我至少希望能看到 h1 标签,因为它在表格之外...
网站: https://www.vanguard.com.au/personal/products/en/overview
import requests
from bs4 import BeautifulSoup


lists =[]
url = 'https://www.vanguard.com.au/personal/products/en/overview'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

title = soup.find_all('h1', class_='heading2 gbs-font-vanguard-red')
for links in soup.find_all('a', style='padding-bottom: 1px;'):
    link_text = links['href']
    lists.append(link_text)

print(title)
print(lists)

最常见的问题是,此页面使用 JavaScript 添加元素,但 requests/BeautifulSoup 无法运行 JavaScript。您可能需要使用 Selenium 来控制真实的网络浏览器,以便运行 JavaScript - furas
或者您可以尝试在Chrome/Firefox中使用“DevTools”(选项卡“网络”,过滤器“XHR”)查找JavaScript用于获取数据的URL,并在“requests”中使用此URL。 JavaScript可以将数据作为JSON获取,因此可能不需要使用BeautifulSoup。 - furas
3个回答

2
如果问题是由JavaScript事件监听器引起的,我建议您使用beautifulsoupselenium来爬取此网站。因此,让我们使用selenium发送请求并获取页面源代码,然后使用beautifulsoup进行解析。
此外,您应该使用title = soup.find()而不是title = soup.findall()以仅获取一个标题。
以下是使用Firefox的代码示例:
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from bs4 import BeautifulSoup


url = 'https://www.vanguard.com.au/personal/products/en/overview'
browser = webdriver.Firefox(executable_path=GeckoDriverManager().install())
browser.get(url)

soup = BeautifulSoup(browser.page_source, 'html.parser')
browser.close()

lists =[]
title = soup.find('h1', class_='heading2 gbs-font-vanguard-red')
for links in soup.find_all('a', style='padding-bottom: 1px;'):
    link_text = links['href']
    lists.append(link_text)

print(title)
print(lists)

输出:

<h1 class="heading2 gbs-font-vanguard-red">Investment products</h1>
['/personal/products/en/detail/8132', '/personal/products/en/detail/8219', '/personal/products/en/detail/8121',...,'/personal/products/en/detail/8217']

我遇到了一个错误:无法使用命令获取google-chrome的版本...当前的google-chrome版本是UNKNOWN。获取未知google-chrome的最新chromedriver版本。 - turtle69
你已经下载了ChromeDriver吗?还是和我一样使用webdriver manager - JayPeerachai
你试过这个吗?from webdriver_manager.chrome import ChromeDriverManagerdriver = webdriver.Chrome(ChromeDriverManager().install()) - JayPeerachai
我已经更新了我的代码,但仍然无法正常工作:from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager from bs4 import BeautifulSoup url = 'https://www.vanguard.com.au/personal/products/en/overview' browser = webdriver.Firefox(executable_path=ChromeDriverManager().install()) browser.get(url) - turtle69
@turtle69 ChromeDriverManager() 只适用于 Chrome() - furas
显示剩余2条评论

2
最常见的问题(在许多现代页面中):此页面使用 JavaScript 添加元素,但 requests/BeautifulSoup 无法运行 JavaScript。
您可能需要使用 Selenium 来控制真实的网络浏览器,以便能够运行 JavaScript。
这个例子仅使用 Selenium,没有使用 BeautifulSoup。
我使用了 xpath,但你也可以使用 css 选择器。
from selenium import webdriver
from selenium.webdriver.common.by import By
             
url = 'https://www.vanguard.com.au/personal/products/en/overview'

lists = []

#driver = webdriver.Chrome(executable_path="/path/to/chromedrive.exe")
driver = webdriver.Firefox(executable_path="/path/to/geckodrive.exe")
driver.get(url)

title = driver.find_element(By.XPATH, '//h1[@class="heading2 gbs-font-vanguard-red"]')
print(title.text)

all_items = driver.find_elements(By.XPATH, '//a[@style="padding-bottom: 1px;"]')

for links in all_items:
    link_text = links.get_attribute('href')
    print(link_text)
    lists.append(link_text)


我遇到了一个错误:selenium.common.exceptions.WebDriverException: Message: 'chromedriver' 可执行文件需要在 PATH 中。 - turtle69
你需要将/full/path/to/chromedrive.exe添加到Chrome()中,或者将带有chromedrive.exe的文件夹添加到系统变量PATH中。在Linux上,我有一个名为~/bin的文件夹在PATH中,并将所有可执行文件保存在此文件夹中,因此我不必在Chrome()/Firefox()中添加它。 - furas

1

与通过Selenium获取数据相比,从源头获取数据总是更有效率的。看起来链接是通过portId创建的。

import pandas as pd
import requests


url = 'https://www3.vanguard.com.au/personal/products/funds.json'
payload = {
'context': '/personal/products/',
'countryCode': 'au.ret',
'paths': "[[['funds','legacyFunds'],'AU']]",
'method': 'get'}

jsonData = requests.get(url, params=payload).json()

results = jsonData['jsonGraph']['funds']['AU']['value']


df1 = pd.json_normalize(results, record_path=['children'])
df2 = pd.json_normalize(results, record_path=['listings'])


df = pd.concat([df1, df2], axis=0)
df['url_link'] = 'https://www.vanguard.com.au/personal/products/en/detail/' + df['portId'] + '/Overview'

输出:

print(df[['fundName', 'url_link']])
                                             fundName                                           url_link
0         Vanguard Active Emerging Market Equity Fund  https://www.vanguard.com.au/personal/products/...
1             Vanguard Active Global Credit Bond Fund  https://www.vanguard.com.au/personal/products/...
2                  Vanguard Active Global Growth Fund  https://www.vanguard.com.au/personal/products/...
3   Vanguard Australian Corporate Fixed Interest I...  https://www.vanguard.com.au/personal/products/...
4       Vanguard Australian Fixed Interest Index Fund  https://www.vanguard.com.au/personal/products/...
..                                                ...                                                ...
23  Vanguard MSCI Australian Small Companies Index...  https://www.vanguard.com.au/personal/products/...
24  Vanguard MSCI Index International Shares (Hedg...  https://www.vanguard.com.au/personal/products/...
25       Vanguard MSCI Index International Shares ETF  https://www.vanguard.com.au/personal/products/...
26  Vanguard MSCI International Small Companies In...  https://www.vanguard.com.au/personal/products/...
27  Vanguard International Credit Securities Hedge...  https://www.vanguard.com.au/personal/products/...

[66 rows x 2 columns]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接