如何使用Python从Google搜索中提取描述信息?

4

我希望从谷歌搜索中提取描述信息,现在我有这段代码:

from urlparse import urlparse, parse_qs
import urllib

from lxml.html import fromstring
from requests import get


    url='https://www.google.com/search?q=Gotham'
    raw = get(url).text
    pg = fromstring(raw)
    v=[]
    for result in pg.cssselect(".r a"):
      url = result.get("href")
      if url.startswith("/url?"):
         url = parse_qs(urlparse(url).query)['q']
      print url[0]

提取与搜索相关的URL,如何提取出在URL下方显示的描述?

1
你应该小心使用编程查询谷歌。如果频繁使用,可能会因违反其服务条款而被封禁。我可以建议你使用他们的自定义搜索API。 - Glubbdrubb
1个回答

1

您可以使用BeautifulSoup网络爬虫库来抓取Google搜索描述网站。

要从所有页面收集信息,您可以使用“分页”和while True循环。while循环是一个无限循环,在我们的情况下退出循环的条件是存在一个切换到下一页的按钮,即CSS选择器“.d6cvqb a[id=pnnext]”:

if soup.select_one('.d6cvqb a[id=pnnext]'):
        params["start"] += 10
else:
    break

您可以使用CSS选择器搜索来查找所有您需要的信息(描述,标题等),这些信息在页面上使用SelectorGadget Chrome扩展程序很容易识别(如果网站通过JavaScript渲染,则不总是完美工作)。

确保您使用request headers user-agent 作为“真实”用户访问。因为默认的requests user-agentpython-requests,而网站会认为它很可能是一个发送请求的脚本。检查一下您的user-agent

请在 在线IDE 中检查代码。

from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "gotham",       # query
    "hl": "en",          # language
    "gl": "us",          # country of the search, US -> USA
    "start": 0,          # number page by default up to 0
    #"num": 100          # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

page_num = 0

website_data = []

while True:
    page_num += 1
    print(f"page: {page_num}")
        
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        website_name = result.select_one(".yuRUbf a")["href"]
        try:
          description = result.select_one(".lEBKkf").text
        except:
          description = None
                    
        website_data.append({
              "website_name": website_name,
              "description": description  
        })
      
    if soup.select_one('.d6cvqb a[id=pnnext]'):
        params["start"] += 10
    else:
        break

print(json.dumps(website_data, indent=2, ensure_ascii=False))

示例输出:

[
    {
    "website_name": "https://www.imdb.com/title/tt3749900/",
    "description": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
  },
  {
    "website_name": "https://www.netflix.com/watch/80023082",
    "description": "When the key witness in a homicide ends up dead while being held for questioning, Gordon suspects an inside job and seeks details from an old friend."
  },
  {
    "website_name": "https://www.gothamknightsgame.com/",
    "description": "Gotham Knights is an open-world, action RPG set in the most dynamic and interactive Gotham City yet. In either solo-play or with one other hero, ..."
  },
  # ...
]

您也可以使用来自SerpApi的Google搜索引擎结果API。它是一个付费API,有免费计划。不同之处在于它将绕过Google的阻止(包括CAPTCHA),无需创建解析器和维护它。

代码示例:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os

params = {
  "api_key": os.getenv("API_KEY"), # serpapi key
  "engine": "google",              # serpapi parser engine
  "q": "gotham",                   # search query
  "num": "100"                     # number of results per page (100 per page in this case)
  # other search parameters: https://serpapi.com/search-api#api-parameters
}

search = GoogleSearch(params)      # where data extraction happens

organic_results_data = []
page_num = 0

while True:
    results = search.get_dict()    # JSON -> Python dictionary
    
    page_num += 1
    
    for result in results["organic_results"]:
        organic_results_data.append({
            "title": result.get("title"),
            "snippet": result.get("snippet")   
        })
    
    if "next_link" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
    else:
        break
    
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

输出:

[
   {
    "title": "Gotham (TV Series 2014–2019) - IMDb",
    "snippet": "The show follows Jim as he cracks strange cases whilst trying to help a young Bruce Wayne solve the mystery of his parents' murder. It seemed each week for a ..."
  },
  {
    "title": "Gotham (TV series) - Wikipedia",
    "snippet": "Gotham is an American superhero crime drama television series developed by Bruno Heller, produced by Warner Bros. Television and based on characters from ..."
  },
  # ...
]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接