我想使用Python脚本在Google中搜索文本,并返回每个结果的名称、描述和URL。目前我正在使用以下代码:
from google import search
ip=raw_input("What would you like to search for? ")
for url in search(ip, stop=20):
print(url)
这将仅返回URL。如何返回每个URL的名称和描述?
我想使用Python脚本在Google中搜索文本,并返回每个结果的名称、描述和URL。目前我正在使用以下代码:
from google import search
ip=raw_input("What would you like to search for? ")
for url in search(ip, stop=20):
print(url)
这将仅返回URL。如何返回每个URL的名称和描述?
stop=20
参数。看起来这个库只能返回URL,使其极不完善。因此,你目前使用的库无法实现你想要的功能。from google import google
num_page = 3
search_results = google.search("This is my query", num_page)
for result in search_results:
print(result.description)
虽然不完全符合我的要求,但我已经找到了一个不错的解决方案(如果我能将其改进,可能会对此进行编辑)。 我结合了像我之前一样在谷歌中搜索(只返回URL)和使用Beautiful Soup包解析HTML页面的方法:
from googlesearch import search
import urllib
from bs4 import BeautifulSoup
def google_scrape(url):
thepage = urllib.urlopen(url)
soup = BeautifulSoup(thepage, "html.parser")
return soup.title.text
i = 1
query = 'search this'
for url in search(query, stop=10):
a = google_scrape(url)
print str(i) + ". " + a
print url
print " "
i += 1
这给我提供了页面标题和链接列表。
另外还有其他优秀的解决方案:
from googlesearch import search
import requests
for url in search(ip, stop=10):
r = requests.get(url)
title = everything_between(r.text, '<title>', '</title>')
from googlesearch import search
用 'googlesearch' 代替 'google' ;) - Jayesh Dhandha我试过的大多数方法都无法正常工作,或者出现错误,例如尽管导入了包,也会出现搜索模块未找到的错误。或者我使用了selenium web driver,如果与Firefox、chrome或Phantom web browser一起使用,它可以很好地工作,但是在执行时间方面还是有点慢,因为它首先查询浏览器,然后返回搜索结果。
因此,我考虑使用Google API,它可以快速准确地返回结果。
在分享代码之前,这里有几个快速提示要遵循:
就这些了,现在你所需要做的就是运行以下代码:
from googleapiclient.discovery import build
my_api_key = "your API KEY TYPE HERE"
my_cse_id = "YOUR CUSTOM SEARCH ENGINE ID TYPE HERE"
def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
return res['items']
results= google_search("YOUR SEARCH QUERY HERE",my_api_key,my_cse_id,num=10)
for result in results:
print(result["link"])
from serpapi import GoogleSearch
params = {
"q" : "Coffee",
"location" : "Austin, Texas, United States",
"hl" : "en",
"gl" : "us",
"google_domain" : "google.com",
"api_key" : "demo",
}
query = GoogleSearch(params)
dictionary_results = query.get_dict()
GitHub: https://github.com/serpapi/google-search-results-python
GitHub: https://github.com/serpapi/google-search-results-python通常情况下,你不能通过在Python3中导入Google包来使用谷歌搜索功能。但是你可以在Python2中使用。
即使使用requests.get(url+query)方法,也无法进行爬取,因为Google会重定向到验证码页面以防止爬取。
可能的解决方法:
import google_search_origin
if __name__ == '__main__':
# Initialisation of the class
google_search = google_search_origin.GoogleSearchOrigin(search='sun')
# Request from the url assembled
google_search.request_url()
# Display the link parsed depending on the result
print(google_search.get_all_links())
# Modify the parameter
google_search.parameter_search('dog')
# Assemble the url
google_search.assemble_url()
# Request from the url assembled
google_search.request_url()
# Display the raw text depending on the result
print(google_search.get_response_text())
while
循环从所有页面提取数据。
while
循环将遍历所有页面,无论有多少个,直到满足某个条件为止。在我们的情况下,这是页面上按钮的存在(.d6cvqb a[id=pnnext]
CSS选择器):# stop the loop on the absence of the next page
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
user-agent
的headers
,然后该网站将假定您是一个用户并显示信息。
在在线IDE中查看完整代码。
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
query = input("What would you like to search for? ")
params = {
"q": query, # query example
"hl": "en", # language
"gl": "uk", # country of the search, UK -> United Kingdom
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
page_limit = 10 # page limit if you don't need to fetch everything
page_num = 0
data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
try:
snippet = result.select_one(".lEBKkf span").text
except:
snippet = None
links = result.select_one(".yuRUbf a")["href"]
data.append({
"title": title,
"snippet": snippet,
"links": links
})
# stop loop due to page limit condition
if page_num == page_limit:
break
# stop the loop on the absence of the next page
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
示例输出:
[
{
"title": "Web Scraping with Python - Pluralsight",
"snippet": "There are times in which you need data but there is no API (application programming interface) to be found. Web scraping is the process of extracting data ...",
"links": "https://www.pluralsight.com/paths/web-scraping-with-python"
},
{
"title": "Chapter 8 Web Scraping | Machine learning in python",
"snippet": "Web scraping means extacting data from the “web”. However, web is not just an anonymous internet “out there” but a conglomerat of servers and sites, ...",
"links": "http://faculty.washington.edu/otoomet/machinelearning-py/web-scraping.html"
},
{
"title": "Web scraping 101",
"snippet": "This vignette introduces you to the basics of web scraping with rvest. You'll first learn the basics of HTML and how to use CSS selectors to refer to ...",
"links": "https://cran.r-project.org/web/packages/rvest/vignettes/rvest.html"
},
other results ...
]
如果您想了解更多关于网站抓取的信息,可以查看13种从任何网站抓取公共数据的方法博客文章。