Python：使用BeautifulSoup进行Google搜索爬取

Question

Python：使用BeautifulSoup进行Google搜索爬取

pythonscreen-scrapingweb-scrapingbeautifulsoupurllib

6

目标：传递一个搜索字符串以在 Google 上搜索，并抓取 URL、标题和与 URL 标题一起发布的小描述。

我有以下代码，目前我的代码仅提供默认谷歌限制一页的前10个结果。我不确定如何在网页抓取期间处理分页。当我查看实际页面结果和打印出来的内容时存在差异。我也不确定解析 span 元素的最佳方法是什么。

到目前为止，我已经将 span 如下所示，并希望删除元素并连接其余的字符串。最好的方法是什么？
The Beautiful Soup Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he ... </span

代码：

from BeautifulSoup import BeautifulSoup import urllib, urllib2 def google_scrape(query): address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query)) request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}) urlfile = urllib2.urlopen(request) page = urlfile.read() soup = BeautifulSoup(page) linkdictionary = {} for li in soup.findAll('li', attrs={'class':'g'}): sLink = li.find('a') print sLink['href'] sSpan = li.find('span', attrs={'class':'st'}) print sSpan return linkdictionary if __name__ == '__main__': links = google_scrape('beautifulsoup')

我的输出看起来是这样的：
http://www.crummy.com/software/BeautifulSoup/ Beautiful Soup: a library designed for screen-scraping HTML and XML. http://pypi.python.org/pypi/BeautifulSoup/3.2.1 Feb 16, 2012 – HTML/XML parser for quick-turnaround applications like screen-scraping. http://www.beautifulsouptheatercollective.org/ The Beautiful Soup Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he ... http://lxml.de/elementsoup.html BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2. BeautifulSoup uses a different parsing ... https://launchpad.net/beautifulsoup/ The discussion group is at: http://groups.google.com/group/beautifulsoup · Home page ... Beautiful Soup 4.0 series is the current focus of development ... http://www.poetry-online.org/carroll_beautiful_soup.htm Beautiful Soup BEAUTIFUL Soup, so rich and green, Waiting in a hot tureen! Who for such dainties would not stoop? Soup of the evening, beautiful Soup! http://www.youtube.com/watch?v=hDG73IAO5M8 Jul 6, 2009 – taken from the motion picture "Alice in wonderland" (1999) http://www.imdb.com/title/tt0164993/ http://www.soupsong.com/ A witty and substantive research effort on the history of soup and food in all cultures, with over 400 pages of recipes, quotations, stories, traditions, literary ... http://www.facebook.com/beautifulsouptc To connect with The Beautiful Soup Theater Collective, sign up for Facebook ... We're thrilled to announce the cast of Beautiful Soup's upcoming production of ... http://blog.dispatched.ch/webscraping-with-python-and-beautifulsoup/ Mar 15, 2009 – Recently my life has been a hype; partly due to my upcoming Python addiction. There's simply no way around it; so I should better confess it in ... 

谷歌搜索页面结果具有以下结构：

<li class="g"> <div class="vsc" sig="bl_" bved="0CAkQkQo" pved="0CAgQkgowBQ"> <h3 class="r"> <div class="vspib" aria-label="Result details" role="button" tabindex="0"> <div class="s"> <div class="f kv"> <div id="poS5" class="esc slp" style="display:none"> <div class="f slp">3 answers - Jan 16, 2009</div> I read this without finding the solution: ... The "normal" way is to: Go to the Beautiful Soup web site, ... Brian beat me too it, but since I already have ... </div> <div> </div> <h3 id="tbpr_6" class="tbpr" style="display:none"> </li>

每个搜索结果都列在<li>元素下面。

- add-semi-colons

3个回答

0

我构建了一个简单的HTML正则表达式，然后调用替换函数来清理字符串并移除点。

import re

p = re.compile(r'<.*?>')
print p.sub('',str(sSpan)).replace('.','')

之前

<span class="st">The <em>Beautiful Soup</em> is a collection of all the pretty places you would rather be. All posts are credited via a click through link. For further inspiration of pretty things, <b>...</b><br /></span>

之后

The Beautiful Soup is a collection of all the pretty places you would rather be All posts are credited via a click through link For further inspiration of pretty things,

- add-semi-colons

0

要从 span 标签中获取文本元素，您可以使用 .text/get_text() 方法，这是由 beautifulsoup 提供的。 Bs4 承担了所有的重活，您不需要担心如何摆脱  标签。

代码和完整示例（Google 不会显示超过 ~400 个结果）：

from bs4 import BeautifulSoup
import requests, lxml, urllib.parse


def print_extracted_data_from_url(url):
    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    response = requests.get(url, headers=headers).text

    soup = BeautifulSoup(response, 'lxml')

    print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
    print(f'Current URL: {url}')
    print()

    for container in soup.findAll('div', class_='tF2Cxc'):
        head_text = container.find('h3', class_='LC20lb DKV0Md').text
        head_sum = container.find('div', class_='IsZvec').text
        head_link = container.a['href']
        print(head_text)
        print(head_sum)
        print(head_link)
        print()

    return soup.select_one('a#pnnext')


def scrape():
    next_page_node = print_extracted_data_from_url(
        'https://www.google.com/search?hl=en-US&q=coca cola')

    while next_page_node is not None:
        next_page_url = urllib.parse.urljoin('https://www.google.com',
                                             next_page_node['href'])

        next_page_node = print_extracted_data_from_url(next_page_url)

scrape()

输出：

Results via beautifulsoup

Current page: 1
Current URL: https://www.google.com/search?hl=en-US&q=coca cola

The Coca-Cola Company: Refresh the World. Make a Difference
We are here to refresh the world and make a difference. Learn more about the Coca-Cola Company, our brands, and how we strive to do business the right way.‎Contact Us · ‎Careers · ‎Coca-Cola · ‎Coca-Cola System
https://www.coca-colacompany.com/home

Coca-Cola
2021 The Coca-Cola Company, all rights reserved. COCA-COLA®, "TASTE THE FEELING", and the Contour Bottle are trademarks of The Coca-Cola Company.
https://www.coca-cola.com/

Together Tastes Better | Coca-Cola®
Coca-Cola is pairing up with celebrity chefs, talented athletes and more surprise guests all summer long to bring you and your loved ones together over the love ...
https://us.coca-cola.com/

或者，您可以使用来自SerpApi的Google搜索引擎结果API来实现这一点。它是一个付费API，具有免费计划。请查看Playground进行测试。

集成代码：

import os
from serpapi import GoogleSearch

def scrape():
  
  params = {
    "engine": "google",
    "q": "coca cola",
    "api_key": os.getenv("API_KEY"),
  }

  search = GoogleSearch(params)
  results = search.get_dict()

  print(f"Current page: {results['serpapi_pagination']['current']}")

  for result in results["organic_results"]:
      print(f"Title: {result['title']}\nLink: {result['link']}\n")

  while 'next' in results['serpapi_pagination']:
      search.params_dict["start"] = results['serpapi_pagination']['current'] * 10
      results = search.get_dict()

      print(f"Current page: {results['serpapi_pagination']['current']}")

      for result in results["organic_results"]:
          print(f"Title: {result['title']}\nLink: {result['link']}\n")

输出：

Results from SerpApi

Current page: 1
Title: The Coca-Cola Company: Refresh the World. Make a Difference
Link: https://www.coca-colacompany.com/home

Title: Coca-Cola
Link: https://www.coca-cola.com/

Title: Together Tastes Better | Coca-Cola®
Link: https://us.coca-cola.com/

Title: Coca-Cola - Wikipedia
Link: https://en.wikipedia.org/wiki/Coca-Cola

Title: Coca-Cola - Home | Facebook
Link: https://www.facebook.com/Coca-Cola/

Title: The Coca-Cola Company | LinkedIn
Link: https://www.linkedin.com/company/the-coca-cola-company

Title: Coca-Cola UNITED: Home
Link: https://cocacolaunited.com/

Title: World of Coca-Cola: Atlanta Museum & Tourist Attraction
Link: https://www.worldofcoca-cola.com/

Current page: 2
Title: Coca-Cola (@CocaCola) | Twitter
Link: https://twitter.com/cocacola?lang=en

免责声明，我在SerpApi工作。

- Dmitriy Zub

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ChrisGuest · Accepted Answer

这个列表推导式将会去除标签。

>>> sSpan
<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
>>> [em.replaceWithChildren() for em in sSpan.findAll('em')]
[None]
>>> sSpan
<span class="st">The Beautiful Soup Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>