Python - 从谷歌图像搜索下载图片?

45

我想使用Python下载谷歌图片搜索中的所有图片。 我正在使用的代码似乎有时候存在一些问题。我的代码是:

import os
import sys
import time
from urllib import FancyURLopener
import urllib2
import simplejson

# Define search term
searchTerm = "parrot"

# Replace spaces ' ' in search term for '%20' in order to comply with request
searchTerm = searchTerm.replace(' ','%20')


# Start FancyURLopener with defined version 
class MyOpener(FancyURLopener): 
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127     Firefox/2.0.0.11'
    myopener = MyOpener()

    # Set count to 0
    count= 0

    for i in range(0,10):
    # Notice that the start changes for each iteration in order to request a new set of   images for each loop
    url = ('https://ajax.googleapis.com/ajax/services/search/images?' + 'v=1.0& q='+searchTerm+'&start='+str(i*10)+'&userip=MyIP')
    print url
    request = urllib2.Request(url, None, {'Referer': 'testing'})
    response = urllib2.urlopen(request)

# Get results using JSON
    results = simplejson.load(response)
    data = results['responseData']
    dataInfo = data['results']

# Iterate for each result and get unescaped url
    for myUrl in dataInfo:
        count = count + 1
        my_url = myUrl['unescapedUrl']
        myopener.retrieve(myUrl['unescapedUrl'],str(count)+'.jpg')        

下载了几个页面后,我遇到了以下错误:

Traceback (最近一次调用):

  File "C:\Python27\img_google3.py", line 37, in <module>
    dataInfo = data['results']
TypeError: 'NoneType' object has no attribute '__getitem__'

该怎么办????


2
A) 发布你的代码 B) 使用谷歌的图像搜索API来实现它。 - brandonscript
https://dev59.com/lX7aa4cB1Zd3GeqPpWFz#22871658 - Omid Raha
2
https://github.com/hardikvasa/google-images-download - hnvasa
14个回答

52

我已经修改了我的代码。现在代码可以下载指定查询的100张图片,而且这些图片都是高清的原始图像。

我正在使用urllib2和Beautiful Soup下载这些图片。

from bs4 import BeautifulSoup
import requests
import re
import urllib2
import os
import cookielib
import json

def get_soup(url,header):
    return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser')


query = raw_input("query image")# you can change the query for the image  here
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
print url
#add the directory for your image here
DIR="Pictures"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
soup = get_soup(url,header)


ActualImages=[]# contains the link for Large original images, type of  image
for a in soup.find_all("div",{"class":"rg_meta"}):
    link , Type =json.loads(a.text)["ou"]  ,json.loads(a.text)["ity"]
    ActualImages.append((link,Type))

print  "there are total" , len(ActualImages),"images"

if not os.path.exists(DIR):
            os.mkdir(DIR)
DIR = os.path.join(DIR, query.split()[0])

if not os.path.exists(DIR):
            os.mkdir(DIR)
###print images
for i , (img , Type) in enumerate( ActualImages):
    try:
        req = urllib2.Request(img, headers={'User-Agent' : header})
        raw_img = urllib2.urlopen(req).read()

        cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
        print cntr
        if len(Type)==0:
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+".jpg"), 'wb')
        else :
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+"."+Type), 'wb')


        f.write(raw_img)
        f.close()
    except Exception as e:
        print "could not load : "+img
        print e

我希望这可以帮助你


是的,@d2a2d,这是可能的,只需将此行替换为:for i, (img, Type) in enumerate(ActualImages[:5]):您只需要在实际图像列表上迭代5个元素。 - rishabhr0y
这段代码 link , Type =json.loads(a.text)["ou"] ,json.loads(a.text)["ity"] 到底是做什么的?当我尝试在jupyter笔记本中运行它时,遇到了一个错误 JSONDecodeError: Expecting value: line 1 column 1 (char 0) - Moondra
1
嘿@rishabhr0y,我当时解决了它。谢谢你的回复。我认为它起作用了。我已经点赞了这个答案。再次感谢! - PallavBakshi
1
完全没问题。 - Akash Kandpal
1
这个已经不再起作用了(对我来说)。我认为现在需要一个带有JavaScript的无头浏览器。我使用Selenium和Chrome headless在Python中编写了一个新的脚本google-images.py:http://sam.aiki.info/b/google-images.py - Sam Watkins
显示剩余5条评论

26

Google Image Search API已被弃用, 想要实现你想要的功能,需要使用Google自定义搜索。要获取图片,你需要这样做:

import urllib2
import simplejson
import cStringIO

fetcher = urllib2.build_opener()
searchTerm = 'parrot'
startIndex = 0
searchUrl = "http://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=" + searchTerm + "&start=" + startIndex
f = fetcher.open(searchUrl)
deserialized_output = simplejson.load(f)

这将为您提供4个JSON结果,您需要通过递增API请求中的startIndex来迭代获取结果。
要获取图像,您需要使用类似cStringIO的库。
例如,要访问第一张图片,您需要执行以下操作:
imageUrl = deserialized_output['responseData']['results'][0]['unescapedUrl']
file = cStringIO.StringIO(urllib.urlopen(imageUrl).read())
img = Image.open(file)

1
你想使用http://ajax.googleapis.com/ajax/services/search/images而不是http://ajax.googleapis.com/ajax/services/search/web。 - Seba Kerckhof
1
@SebaK:对不起,我误读了你的评论。已经更正了。谢谢 :) - jobin
21
就目前而言,ajax.googleapis.com/ajax/services/search/images地址现在返回403:该API已不再可用。 - Dan Percival
1
我遇到了相同的问题:打印反序列化输出{'responseData': None, 'responseDetails': 'This API is no longer available.', 'responseStatus': 403}。 - Mostafa
4
该API现在已不再可用。 - K-Dawg
显示剩余2条评论

17

Google停用了他们的API,爬取Google很复杂,所以我建议使用Bing API来自动下载图片。pip包bing-image-downloader允许您只需一行代码就可以轻松下载任意数量的图像到目录中。

from bing_image_downloader import downloader

downloader.download(query_string, limit=100, output_dir='dataset', adult_filter_off=True, force_replace=False, timeout=60, verbose=True)

谷歌并不是那么好,微软也不是那么恶劣。


2
即使是必应API似乎也已被弃用。 该页面显示以下消息: “DataMarket和Data Services将被淘汰,并将在2016年12月31日后停止接受新订单。现有的订阅将在2017年3月31日开始被淘汰和取消。如果您想继续使用服务,请联系您的服务提供商以获取选项。” - atif93
2
截至2022年5月,这个程序运行良好(使用python 3.8bing-image-downloader==1.1.2)。 - mdev

7

这是我最新的谷歌图片爬取器,使用Python编写,利用Selenium和无头Chrome实现。

它需要安装python-seleniumchromium-driver以及一个叫做retry的pip模块。

链接:http://sam.aiki.info/b/google-images.py

使用示例:

google-images.py tiger 10 --opts isz:lt,islt:svga,itp:photo > urls.txt
parallel=5
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
(i=0; while read url; do wget -e robots=off -T10 --tries 10 -U"$user_agent" "$url" -O`printf %04d $i`.jpg & i=$(($i+1)) ; [ $(($i % $parallel)) = 0 ] && wait; done < urls.txt; wait)

帮助用法:

$ google-images.py --help
用法:google-images.py [-h] [--safe SAFE] [--opts OPTS] 查询 关于n张图片(大约)
位置参数: query 图片搜索查询 n 图片数量
可选参数: -h, --help 显示此帮助消息并退出 --safe SAFE 安全搜索 [off|active|images] --opts OPTS 搜索选项,例如: isz:lt,islt:svga,itp:photo,ic:color,ift:jpg

代码:

#!/usr/bin/env python3

# requires: selenium, chromium-driver, retry

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import selenium.common.exceptions as sel_ex
import sys
import time
import urllib.parse
from retry import retry
import argparse
import logging

logging.basicConfig(stream=sys.stderr, level=logging.INFO)
logger = logging.getLogger()
retry_logger = None

css_thumbnail = "img.Q4LuWd"
css_large = "img.n3VNCb"
css_load_more = ".mye4qd"
selenium_exceptions = (sel_ex.ElementClickInterceptedException, sel_ex.ElementNotInteractableException, sel_ex.StaleElementReferenceException)

def scroll_to_end(wd):
    wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")

@retry(exceptions=KeyError, tries=6, delay=0.1, backoff=2, logger=retry_logger)
def get_thumbnails(wd, want_more_than=0):
    wd.execute_script("document.querySelector('{}').click();".format(css_load_more))
    thumbnails = wd.find_elements_by_css_selector(css_thumbnail)
    n_results = len(thumbnails)
    if n_results <= want_more_than:
        raise KeyError("no new thumbnails")
    return thumbnails

@retry(exceptions=KeyError, tries=6, delay=0.1, backoff=2, logger=retry_logger)
def get_image_src(wd):
    actual_images = wd.find_elements_by_css_selector(css_large)
    sources = []
    for img in actual_images:
        src = img.get_attribute("src")
        if src.startswith("http") and not src.startswith("https://encrypted-tbn0.gstatic.com/"):
            sources.append(src)
    if not len(sources):
        raise KeyError("no large image")
    return sources

@retry(exceptions=selenium_exceptions, tries=6, delay=0.1, backoff=2, logger=retry_logger)
def retry_click(el):
    el.click()

def get_images(wd, start=0, n=20, out=None):
    thumbnails = []
    count = len(thumbnails)
    while count < n:
        scroll_to_end(wd)
        try:
            thumbnails = get_thumbnails(wd, want_more_than=count)
        except KeyError as e:
            logger.warning("cannot load enough thumbnails")
            break
        count = len(thumbnails)
    sources = []
    for tn in thumbnails:
        try:
            retry_click(tn)
        except selenium_exceptions as e:
            logger.warning("main image click failed")
            continue
        sources1 = []
        try:
            sources1 = get_image_src(wd)
        except KeyError as e:
            pass
            # logger.warning("main image not found")
        if not sources1:
            tn_src = tn.get_attribute("src")
            if not tn_src.startswith("data"):
                logger.warning("no src found for main image, using thumbnail")          
                sources1 = [tn_src]
            else:
                logger.warning("no src found for main image, thumbnail is a data URL")
        for src in sources1:
            if not src in sources:
                sources.append(src)
                if out:
                    print(src, file=out)
                    out.flush()
        if len(sources) >= n:
            break
    return sources

def google_image_search(wd, query, safe="off", n=20, opts='', out=None):
    search_url_t = "https://www.google.com/search?safe={safe}&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img&tbs={opts}"
    search_url = search_url_t.format(q=urllib.parse.quote(query), opts=urllib.parse.quote(opts), safe=safe)
    wd.get(search_url)
    sources = get_images(wd, n=n, out=out)
    return sources

def main():
    parser = argparse.ArgumentParser(description='Fetch image URLs from Google Image Search.')
    parser.add_argument('--safe', type=str, default="off", help='safe search [off|active|images]')
    parser.add_argument('--opts', type=str, default="", help='search options, e.g. isz:lt,islt:svga,itp:photo,ic:color,ift:jpg')
    parser.add_argument('query', type=str, help='image search query')
    parser.add_argument('n', type=int, default=20, help='number of images (approx)')
    args = parser.parse_args()

    opts = Options()
    opts.add_argument("--headless")
    # opts.add_argument("--blink-settings=imagesEnabled=false")
    with webdriver.Chrome(options=opts) as wd:
        sources = google_image_search(wd, args.query, safe=args.safe, n=args.n, opts=args.opts, out=sys.stdout)

main()

4

我没有看过你的代码,但这是一个使用selenium制作的示例解决方案,尝试从搜索词中获取400张图片。

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import json
import os
import urllib2

searchterm = 'vannmelon' # will also be the name of the folder
url = "https://www.google.co.in/search?q="+searchterm+"&source=lnms&tbm=isch"
browser = webdriver.Firefox()
browser.get(url)
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
counter = 0
succounter = 0

if not os.path.exists(searchterm):
    os.mkdir(searchterm)

for _ in range(500):
    browser.execute_script("window.scrollBy(0,10000)")

for x in browser.find_elements_by_xpath("//div[@class='rg_meta']"):
    counter = counter + 1
    print "Total Count:", counter
    print "Succsessful Count:", succounter
    print "URL:",json.loads(x.get_attribute('innerHTML'))["ou"]

    img = json.loads(x.get_attribute('innerHTML'))["ou"]
    imgtype = json.loads(x.get_attribute('innerHTML'))["ity"]
    try:
        req = urllib2.Request(img, headers={'User-Agent': header})
        raw_img = urllib2.urlopen(req).read()
        File = open(os.path.join(searchterm , searchterm + "_" + str(counter) + "." + imgtype), "wb")
        File.write(raw_img)
        File.close()
        succounter = succounter + 1
    except:
            print "can't get img"

print succounter, "pictures succesfully downloaded"
browser.close()

1
它说没有下载任何图片。谷歌关闭了这个类吗? - Beyhan Gul

4

Piees的答案的基础上,为了从搜索结果中下载任意数量的图像,我们需要在加载了前400个结果后模拟点击“显示更多结果”按钮。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import os
import json
import urllib2
import sys
import time

# adding path to geckodriver to the OS environment variable
# assuming that it is stored at the same path as this script
os.environ["PATH"] += os.pathsep + os.getcwd()
download_path = "dataset/"

def main():
    searchtext = sys.argv[1] # the search query
    num_requested = int(sys.argv[2]) # number of images to download
    number_of_scrolls = num_requested / 400 + 1 
    # number_of_scrolls * 400 images will be opened in the browser

    if not os.path.exists(download_path + searchtext.replace(" ", "_")):
        os.makedirs(download_path + searchtext.replace(" ", "_"))

    url = "https://www.google.co.in/search?q="+searchtext+"&source=lnms&tbm=isch"
    driver = webdriver.Firefox()
    driver.get(url)

    headers = {}
    headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
    extensions = {"jpg", "jpeg", "png", "gif"}
    img_count = 0
    downloaded_img_count = 0

    for _ in xrange(number_of_scrolls):
        for __ in xrange(10):
            # multiple scrolls needed to show all 400 images
            driver.execute_script("window.scrollBy(0, 1000000)")
            time.sleep(0.2)
        # to load next 400 images
        time.sleep(0.5)
        try:
            driver.find_element_by_xpath("//input[@value='Show more results']").click()
        except Exception as e:
            print "Less images found:", e
            break

    # imges = driver.find_elements_by_xpath('//div[@class="rg_meta"]') # not working anymore
    imges = driver.find_elements_by_xpath('//div[contains(@class,"rg_meta")]')
    print "Total images:", len(imges), "\n"
    for img in imges:
        img_count += 1
        img_url = json.loads(img.get_attribute('innerHTML'))["ou"]
        img_type = json.loads(img.get_attribute('innerHTML'))["ity"]
        print "Downloading image", img_count, ": ", img_url
        try:
            if img_type not in extensions:
                img_type = "jpg"
            req = urllib2.Request(img_url, headers=headers)
            raw_img = urllib2.urlopen(req).read()
            f = open(download_path+searchtext.replace(" ", "_")+"/"+str(downloaded_img_count)+"."+img_type, "wb")
            f.write(raw_img)
            f.close
            downloaded_img_count += 1
        except Exception as e:
            print "Download failed:", e
        finally:
            print
        if downloaded_img_count >= num_requested:
            break

    print "Total downloaded: ", downloaded_img_count, "/", img_count
    driver.quit()

if __name__ == "__main__":
    main()

完整代码在这里


这段代码一周前还能正常运行,但现在无法找到图像。你能修改一下吗? - Beyhan Gul
请问您能告诉我您遇到的错误或问题吗? - atif93
@atif93你的代码可以运行,但是出现了一个异常,说找到的图片数量太少了...这是什么意思? - Ali Yar Khan
有时候,对于一个查询,Google 图像搜索结果可能会少于您查询的 <number_of_images> 张图片。这种情况下,它会显示找到了较少的图片。 - atif93

3

以下翻译适用于Windows 10,Python 3.9.7:

pip install bing-image-downloader

下面的代码将从Bing搜索引擎下载10张来自印度的图片到所需的输出文件夹中:

from bing_image_downloader import downloader
downloader.download('India', limit=10,  output_dir='dataset', adult_filter_off=True, force_replace=False, timeout=60, verbose=True)

文档:https://pypi.org/project/bing-image-downloader/

该网址是一个Python软件包的文档链接,用于从必应搜索下载图像。

这段代码在2023年仍然有效,太棒了。 - undefined

3
您也可以使用Python来操作Selenium。以下是操作步骤:
from selenium import webdriver
import urllib
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import urllib.request

driver = webdriver.Firefox()
word="apple"
url="http://images.google.com/search?q="+word+"&tbm=isch&sout=1"
driver.get(url)
imageXpathSelector='/html/body/div[2]/c-wiz/div[3]/div[1]/div/div/div/div/div[1]/div[1]/span/div[1]/div[1]/div[1]/a[1]/div[1]/img'
img=driver.find_element(By.XPATH,imageXpathSelector)

src=(img.get_attribute('src'))
urllib.request.urlretrieve(src, word+".jpg")
driver.close()

(This code works on Python 3.8) 请注意,您需要使用 'pip install selenium' 命令安装Selenium包,这样才能正常运行。
与其他网络爬虫技术不同,Selenium会打开浏览器并下载内容,因为其主要用于测试而非爬取数据。
注:如果 imageXpathSelector 不起作用,请在浏览器打开的情况下按 F12 键,然后右键单击图像,从打开的菜单中选择“复制”菜单,并选择“复制Xpath”。这将是您所需元素的正确Xpath位置。

2

这段代码和其他代码片段一样已经过时,对我来说不再起作用。受上面某个解决方案的启发,每个关键词下载100张图片。

from bs4 import BeautifulSoup
import urllib2
import os


class GoogleeImageDownloader(object):
    _URL = "https://www.google.co.in/search?q={}&source=lnms&tbm=isch"
    _BASE_DIR = 'GoogleImages'
    _HEADERS = {
        'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
    }

    def __init__(self):
        query = raw_input("Enter keyword to search images\n")
        self.dir_name = os.path.join(self._BASE_DIR, query.split()[0])
        self.url = self._URL.format(urllib2.quote(query)) 
        self.make_dir_for_downloads()
        self.initiate_downloads()

    def make_dir_for_downloads(self):
        print "Creating necessary directories"
        if not os.path.exists(self._BASE_DIR):
            os.mkdir(self._BASE_DIR)

        if not os.path.exists(self.dir_name):
            os.mkdir(self.dir_name)

    def initiate_downloads(self):
        src_list = []
        soup = BeautifulSoup(urllib2.urlopen(urllib2.Request(self.url,headers=self._HEADERS)),'html.parser')
        for img in soup.find_all('img'):
            if img.has_attr("data-src"):
                src_list.append(img['data-src'])
        print "{} of images collected for downloads".format(len(src_list))
        self.save_images(src_list)

    def save_images(self, src_list):
        print "Saving Images..."
        for i , src in enumerate(src_list):
            try:
                req = urllib2.Request(src, headers=self._HEADERS)
                raw_img = urllib2.urlopen(req).read()
                with open(os.path.join(self.dir_name , str(i)+".jpg"), 'wb') as f:
                    f.write(raw_img)
            except Exception as e:
                print ("could not save image")
                raise e


if __name__ == "__main__":
    GoogleeImageDownloader()

1
我使用的是:

https://github.com/hellock/icrawler

这个软件包是一个网络爬虫的迷你框架。采用模块化设计,易于使用和扩展。它非常好地支持媒体数据,如图像和视频,并且也可以应用于文本和其他类型的文件。Scrapy 是重量级和强大的,而 icrawler 则是轻巧和灵活的。

def main():
    parser = ArgumentParser(description='Test built-in crawlers')
    parser.add_argument(
        '--crawler',
        nargs='+',
        default=['google', 'bing', 'baidu', 'flickr', 'greedy', 'urllist'],
        help='which crawlers to test')
    args = parser.parse_args()
    for crawler in args.crawler:
        eval('test_{}()'.format(crawler))
        print('\n')

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接