如何从谷歌搜索结果页面提取特色片段?
BeautifulSoup
网络爬虫库。但是,如果进行了大量请求,可能会出现问题。headers
,其中包含指定您的 user-agent 的信息。这对于谷歌来说很重要,以便将请求识别为来自用户而不是机器人,并且不会阻止它:# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
while
来分页所有页面。只要存在下一个按钮(通过页面上的按钮选择器确定,对于我们来说是CSS选择器“.d6cvqb a[id=pnnext]”),就可以进行分页,如果存在下一页,则需要将["start"]的值增加10以访问下一页,否则,我们需要退出while循环:if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "python", # query example
"hl": "en", # language
"gl": "us", # country of the search, US -> USA
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
page_num = 0
website_data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
try:
snippet = result.select_one(".lEBKkf").text
except:
snippet = None
website_data.append({
"title": title,
"snippet": snippet
})
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
print(json.dumps(website_data, indent=2, ensure_ascii=False))
[
{
"title": "Welcome to Python.org",
"snippet": "The official home of the Python Programming Language."
},
{
"title": "Python (programming language) - Wikipedia",
"snippet": "Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation."
},
{
"title": "Python Courses & Tutorials - Codecademy",
"snippet": "Python is a general-purpose, versatile, and powerful programming language. It's a great first language because Python code is concise and easy to read."
},
{
"title": "Python - GitHub",
"snippet": "Repositories related to the Python Programming language - Python. ... Collection of library stubs for Python, with static types. Python 3.3k 1.4k."
},
{
"title": "Learn Python - Free Interactive Python Tutorial",
"snippet": "learnpython.org is a free interactive Python tutorial for people who want to learn Python, fast."
},
# ...
]
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": os.getenv("API_KEY"), # serpapi key
"engine": "google", # serpapi parser engine
"q": "python", # search query
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
organic_results_data = []
while True:
results = search.get_dict() # JSON -> Python dictionary
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet")
})
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))