Python和BeautifulSoup打开网页

Question

Python和BeautifulSoup打开网页

pythonpython-2.7web-scrapingbeautifulsoup

9

我想知道如何使用BeautifulSoup打开列表中的另一个页面？我按照这个教程进行操作，但是它没有告诉我们如何在列表中打开另一个页面。另外，如果嵌套在类中的"a href"要如何打开？

以下是我的代码：

# coding: utf-8

import requests
from bs4 import BeautifulSoup

r = requests.get("")
soup = BeautifulSoup(r.content)
soup.find_all("a")

for link in soup.find_all("a"):
    print link.get("href")

    for link in soup.find_all("a"):
        print link.text

    for link in soup.find_all("a"):
        print link.text, link.get("href")

    g_data = soup.find_all("div", {"class":"listing__left-column"})

    for item in g_data:
        print item.contents

    for item in g_data:
        print item.contents[0].text
        print link.get('href')

    for item in g_data:
        print item.contents[0]

我正在尝试从每个商家的标题中收集href，并打开它们并爬取数据。

- Brendan Cott

3

首先，我不明白你在问什么。然后，也许你想看一下这份文档。 - Remi Guan

1

您需要告诉我们您想要抓取哪个页面。需要类似于 r = requests.get("http://www.yellowpages.com/") 这样的代码。 - Martin Evans

我应该更详细地解释一下，我想做的是在div中打开href等。http://puu.sh/kmgxZ/15fc324654.png我想调用每个具有链接的href并打开它们的页面，然后开始抓取。 - Brendan Cott

2个回答

1

我遇到了同样的问题，我想分享我的发现，因为我尝试了答案，但由于某些原因它没有起作用，但经过一些研究，我发现了一些有趣的东西。
您可能需要找到“href”链接本身的属性：在您的情况下，您需要确切的class来包含href链接，我认为是“class”：“listing__left-column”，并将其赋值给一个变量，例如“all”：

from bs4 import BeautifulSoup
all = soup.find_all("div", {"class":"listing__left-column"})
for item in all:
  for link in item.find_all("a"):
    if 'href' in link.attrs:
        a = link.attrs['href']
        print(a)
        print("")

我这样做后，成功打开了首页中嵌入的另一个链接。

- derek

嗨，你好 - 你把URL放在哪里啊！？ - zero

r = requests.get("<在此处输入您的URL>") - derek

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martin Evans · Accepted Answer

我仍然不确定你从哪里获取HTML，但如果你想提取所有的href标签，那么根据你发布的图片，以下方法应该有效：

import requests
from bs4 import BeautifulSoup

r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)

for a_tag in soup.find_all('a', class_='listing-name', href=True):
    print 'href: ', a_tag['href']

通过将href=True添加到find_all()中，确保仅返回包含href属性的a元素，因此无需测试它是否为属性。

请注意，有些网站会在一两次尝试后锁定您的帐户，因为它们能够检测到您正在尝试通过脚本访问网站，而不是作为人类在访问。如果您觉得没有得到正确的响应，建议打印您获取的HTML以确保它仍然满足您的预期。

然后，如果您想获取每个链接的HTML，则可以使用以下内容：

import requests
from bs4 import BeautifulSoup

# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)

for a_tag in soup.find_all('a', class_='listing-name', href=True):
    print 'href: ', a_tag['href']

# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"

for a_tag in soup.find_all('a', class_='listing-name', href=True):
    print '-' * 60      # Add a line of dashes
    print 'href: ', a_tag['href']
    request_href = requests.get(base_url + a_tag['href'])
    print request_href.content

在Python 2.x中测试通过，对于Python 3.x，请在print语句中添加括号。