谷歌-获取"特色片段"搜索结果吗?

3
如何从谷歌搜索结果页面提取特色片段

1
你是指自定义搜索吗?现在Web搜索不再具有API,而且通过自动提取内容是被他们的条款禁止的... “您明确同意不通过任何自动化手段(包括使用脚本或网络爬虫)访问(或尝试访问)任何服务…” - samiles
是的,我指的是自定义搜索。 - Yifat Biezuner
在这种情况下,是的,自定义搜索返回的XML结果可以包含您想要的所有元数据。完整文档在此处。 - 实质上,您需要为Google定义自己的响应格式,以便开始返回所需内容。文档的此页面解释了如何添加数据并进行测试。 - samiles
我尝试使用“thing”的cx值,但它给我提供了无关的结果。您是否有推荐的实体类型可用于生成的cx中? - Yifat Biezuner
@YifatBiezuner 你对此有什么看法吗? - Arpit Suthar
1个回答

0
如果您想要抓取谷歌搜索结果摘要,可以使用 BeautifulSoup 网络爬虫库。但是,如果进行了大量请求,可能会出现问题。
为了解决阻塞问题,您可以尝试添加 headers,其中包含指定您的 user-agent 的信息。这对于谷歌来说很重要,以便将请求识别为来自用户而不是机器人,并且不会阻止它:
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

一个额外的步骤可以是旋转用户代理
下面的代码示例显示了使用分页获取更多值的解决方案。您可以使用无限循环的while来分页所有页面。只要存在下一个按钮(通过页面上的按钮选择器确定,对于我们来说是CSS选择器“.d6cvqb a[id=pnnext]”),就可以进行分页,如果存在下一页,则需要将["start"]的值增加10以访问下一页,否则,我们需要退出while循环:
if soup.select_one('.d6cvqb a[id=pnnext]'):
    params["start"] += 10
else:
    break

在线IDE中检查代码。
from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "python",       # query example
    "hl": "en",          # language
    "gl": "us",          # country of the search, US -> USA
    "start": 0,          # number page by default up to 0
    #"num": 100          # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

page_num = 0

website_data = []

while True:
    page_num += 1
    print(f"page: {page_num}")
        
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        title = result.select_one(".DKV0Md").text
        try:
          snippet = result.select_one(".lEBKkf").text
        except:
          snippet = None
                    
        website_data.append({
              "title": title,
              "snippet": snippet  
        })
      
    if soup.select_one('.d6cvqb a[id=pnnext]'):
        params["start"] += 10
    else:
        break

print(json.dumps(website_data, indent=2, ensure_ascii=False))

示例输出:
[
    {
    "title": "Welcome to Python.org",
    "snippet": "The official home of the Python Programming Language."
  },
  {
    "title": "Python (programming language) - Wikipedia",
    "snippet": "Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation."
  },
  {
    "title": "Python Courses & Tutorials - Codecademy",
    "snippet": "Python is a general-purpose, versatile, and powerful programming language. It's a great first language because Python code is concise and easy to read."
  },
  {
    "title": "Python - GitHub",
    "snippet": "Repositories related to the Python Programming language - Python. ... Collection of library stubs for Python, with static types. Python 3.3k 1.4k."
  },
  {
    "title": "Learn Python - Free Interactive Python Tutorial",
    "snippet": "learnpython.org is a free interactive Python tutorial for people who want to learn Python, fast."
  },
  # ...
]

您也可以使用SerpApi的Google搜索引擎结果API。它是一款付费的API,但也提供免费计划。 与其他方法不同的是,它可以绕过Google的阻止(包括CAPTCHA),无需创建解析器并进行维护。
代码示例:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os

params = {
  "api_key": os.getenv("API_KEY"), # serpapi key
  "engine": "google",              # serpapi parser engine
  "q": "python",                   # search query
  "num": "100"                     # number of results per page (100 per page in this case)
  # other search parameters: https://serpapi.com/search-api#api-parameters
}

search = GoogleSearch(params)      # where data extraction happens

organic_results_data = []

while True:
    results = search.get_dict()    # JSON -> Python dictionary
    
    for result in results["organic_results"]:
        organic_results_data.append({
            "title": result.get("title"),
            "snippet": result.get("snippet")   
        })
    
    if "next_link" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
    else:
        break
    
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

输出与bs4的答案完全相同。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接