使用Python进行Google搜索网页抓取

Question

使用Python进行Google搜索网页抓取

pythonpython-2.7google-searchgoogle-search-api

21

最近我一直在学习Python，以便在工作中完成一些项目。

目前我需要使用谷歌搜索结果进行网页爬取。我找到了几个网站来演示如何使用ajax谷歌api进行搜索，但是尝试使用后发现它似乎不再被支持。有什么建议吗？

我已经搜索了很长时间，但似乎找不到任何当前可行的解决方法。

- pbell

你_可以_在没有API的情况下使用Google进行搜索，但如果他们怀疑你是机器人，你很可能会被Google封禁。请阅读TOS，如果想要在任何重要的方面上使用API，你可能需要付费。 - Athena

我研究了如何在没有API的情况下完成它，我必须更改我的头部/用户代理信息。但即使我这样做，我仍然无法获得结果。如果可以的话，我只需在每个请求之间放置一个睡眠计时器，以免被视为机器人。 - pbell

我已经编写了一个谷歌搜索机器人，它运行得很好，但是直接使用机器人违反了谷歌的服务条款，因此我不会发布它。无论您想做什么，最好通过官方API进行。 - Athena

违反Google的网络管理员指南和服务条款向Google提交程序化搜索查询是不允许的。在Google上运行此代码很可能会导致Google在您的IP地址上显示验证码。 - undefined

8个回答

10

你有两个选择。自己构建或使用SERP API。

SERP API将以格式化的JSON响应返回Google搜索结果。

我推荐使用SERP API，因为它更易于使用，而且您不必担心被Google封锁。

1. SERP API

我在scraperbox serp api方面有很好的经验。

您可以使用以下代码调用API。请确保用您的scraperbox API令牌替换YOUR_API_TOKEN。

import urllib.parse
import urllib.request
import ssl
import json
ssl._create_default_https_context = ssl._create_unverified_context

# Urlencode the query string
q = urllib.parse.quote_plus("Where can I get the best coffee")

# Create the query URL.
query = "https://api.scraperbox.com/google"
query += "?token=%s" % "YOUR_API_TOKEN"
query += "&q=%s" % q
query += "&proxy_location=gb"

# Call the API.
request = urllib.request.Request(query)

raw_response = urllib.request.urlopen(request).read()
raw_json = raw_response.decode("utf-8")
response = json.loads(raw_json)

# Print the first result title
print(response["organic_results"][0]["title"])

2. 构建自己的Python爬虫

我最近写了一篇深入的博客文章，介绍了如何使用Python抓取搜索结果（点击此处查看）。

以下是快速摘要：

首先，您应该获取Google搜索结果页面的HTML内容。

import urllib.request

url = 'https://google.com/search?q=Where+can+I+get+the+best+coffee'

# Perform the request
request = urllib.request.Request(url)

# Set a normal User Agent header, otherwise Google will block the request.
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
raw_response = urllib.request.urlopen(request).read()

# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")

然后您可以使用BeautifulSoup来提取搜索结果。例如，以下代码将获取所有标题。

from bs4 import BeautifulSoup

# The code to get the html contents here.

soup = BeautifulSoup(html, 'html.parser')

# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
    # Search for a h3 tag
    results = div.select("h3")

    # Check if we have found a result
    if (len(results) >= 1):

        # Print the title
        h3 = results[0]
        print(h3.get_text())

你可以扩展这段代码，以提取搜索结果的URL和描述。

- Dirk Hoekstra

#1 对于他们自己网页上的基本示例无效。猜测 Google 也影响到了他们。 - Hrvoje

10

以下是另一种可用于爬取搜索引擎结果页面的服务(https://zenserp.com)，它不需要客户端且价格较为便宜。

以下是一个Python代码样例：

import requests

headers = {
    'apikey': '',
}

params = (
    ('q', 'Pied Piper'),
    ('location', 'United States'),
    ('search_engine', 'google.com'),
    ('language', 'English'),
)

response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)

- LeitnerChristoph

1

我使用这个API已经有两个月了，因为它是唯一提供免费计划的API。目前工作得很好，没有遇到任何问题！ - Dominik Kukacka

3

目前的回答可能有效，但谷歌会因为你的爬虫而禁用你的帐户。

我的当前解决方案使用requests_ip_rotator。

import requests
from requests_ip_rotator import ApiGateway
import os

keywords = ['test']


def parse(keyword, session):
    url = f"https://www.google.com/search?q={keyword}"
    response = session.get(url)
    print(response)


if __name__ == '__main__':
    AWS_ACCESS_KEY_ID = ''
    AWS_SECRET_ACCESS_KEY = ''

    gateway = ApiGateway("https://www.google.com", access_key_id=AWS_ACCESS_KEY_ID,
                         access_key_secret=AWS_SECRET_ACCESS_KEY)
    gateway.start()

    session = requests.Session()
    session.mount("https://www.google.com", gateway)

    for keyword in keywords:
        parse(keyword, session)
    gateway.shutdown()

您可以在AWS控制台中创建AWS_ACCESS_KEY_ID和AWS_SECRET_ACCESS_KEY。这个解决方案可以让您解析100万个请求（亚马逊免费限制）。

- Nikolay Pavlin

不错！看起来运行得很好。 - VladislavS

1

你也可以使用第三方服务，例如Serp API - 我编写并运行此工具 - 它是一个付费的 Google 搜索引擎结果 API。它解决了被封锁的问题，而且你不必租用代理或自己进行结果解析。

它很容易与 Python 集成：

from lib.google_search_results import GoogleSearchResults

params = {
    "q" : "Coffee",
    "location" : "Austin, Texas, United States",
    "hl" : "en",
    "gl" : "us",
    "google_domain" : "google.com",
    "api_key" : "demo",
}

query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()

GitHub: https://github.com/serpapi/google-search-results-python

- Hartator

8

您需要支付此API密钥的费用。 - Tejas Krishna Reddy

1

@TejasKrishnaReddy，有一个非商业免费计划，每月可进行100次搜索。 - ilyazub

0

我至少可以想到三种方法来实现这个目标：

1. 使用谷歌的自定义搜索JSON API 2. 创建自己的网络爬虫解决方案 3. 使用SerpApi（推荐）

使用谷歌的自定义搜索JSON API：

你可以使用"谷歌自定义搜索JSON API"。首先，你需要在谷歌云控制台上设置一个自定义搜索引擎（CSE）并获取一个API密钥。一旦你拥有了这两个东西，你就可以使用Python的requests库或者谷歌API客户端库向API发送HTTP请求。通过将搜索查询和API密钥作为参数传递，你将收到以JSON格式返回的搜索结果，然后可以根据需要进行处理。

请记住，该API并非免费，并且有使用限制，所以请监控你的查询以避免意外的费用。

创建自己的DIY解决方案：

如果你想在Python中获取谷歌搜索结果，而又不依赖于谷歌官方的API，你可以使用像BeautifulSoup和requests这样的网络爬虫工具。下面是一个简单的方法：

2.1 使用requests库获取谷歌搜索结果页面的HTML内容。

2.2 使用BeautifulSoup解析HTML，从搜索结果中提取数据。

你可能会遇到IP封禁或其他爬取问题。此外，谷歌的结构可能会发生变化，导致你的爬虫失效。关键是，构建自己的谷歌爬虫会面临许多挑战。

使用SerpApi可以让一切变得简单。SerpAPI提供了一种更结构化和可靠的方式来获取谷歌搜索结果，而无需直接爬取谷歌。SerpAPI实际上充当了一个中间人，处理爬取的复杂性，并提供结构化的JSON结果。因此，你可以节省时间和精力，从谷歌收集数据，而无需构建自己的谷歌爬虫或使用其他网络爬取工具。

这里有一个关于如何使用Python爬取谷歌搜索结果的教程，供你参考第三个选项。

希望对你有所帮助！

- MisterCat

0

您还可以使用Serpdog（https://serpdog.io）的Google搜索API在Python中爬取Google搜索结果

import requests
payload = {'api_key': 'APIKEY', 'q':'coffee' , 'gl':'us'}
resp = requests.get('https://api.serpdog.io/search', params=payload)
print (resp.text)

文档：https://docs.serpdog.io

免责声明：我是serpdog.io的创始人

- Darshan

0

另一个可以用于爬取Google搜索或其他SERP数据的服务是SearchApi。您可能希望检查并测试它，因为它在注册时提供100个免费积分。它提供了丰富的JSON数据集，并包括免费的请求HTML，您可以将HTML数据与结果进行比较。

Google搜索API文档：https://www.searchapi.io/docs/google

Python执行示例：

import requests

payload = {'api_key': 'key', 'engine': 'google', 'q':'pizza'}
response = requests.get('https://www.searchapi.io/api/v1/search', params=payload)

print (response.text)

免责声明：我在SearchApi工作。

- Sebas

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- StuxCrystal · Accepted Answer

您可以直接抓取谷歌搜索结果。为此，您可以使用网址https://google.com/search?q=<Query>，这将返回前10个搜索结果。

然后，例如使用lxml解析页面。根据您使用的工具，可以通过CSS选择器（.r a）或XPath选择器（//h3[@class="r"]/a）查询结果节点树。

在某些情况下，结果URL将重定向到谷歌。通常它包含一个查询参数q，其中包含实际请求的URL。

使用lxml和requests的示例代码：

from urllib.parse import urlencode, urlparse, parse_qs

from lxml.html import fromstring
from requests import get

raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)

for result in page.cssselect(".r a"):
    url = result.get("href")
    if url.startswith("/url?"):
        url = parse_qs(urlparse(url).query)['q']
    print(url[0])

关于谷歌封禁您的IP的说明：根据我的经验，只有在您开始对谷歌进行垃圾搜索请求时，谷歌才会封禁您。如果谷歌认为您是机器人，则会响应503。