批量下载带标签的谷歌图片

Question

批量下载带标签的谷歌图片

pythonimagebatch-processinggoogle-custom-searchgoogle-image-search

4

我正在尝试找到一种高效且可复制的方法批量下载Google图像搜索中的全尺寸图像文件。其他人也问过类似的问题，但我没有找到完全符合我要求或我能理解的内容。

大多数参考了已弃用的Google Image Search API或Google Custom Search API，这似乎对整个网络都不起作用，或者仅关于从单个URL下载图像。

我想这可能是一个两步骤的过程：首先从搜索中提取所有图像URL，然后批量从这些URL下载？

我应该补充说明我是初学者（这可能很明显；抱歉）。因此，如果有人能够解释并指引我正确的方向，那将不胜感激。

我还研究了免费软件选项，但这些软件似乎也不稳定。除非有人知道一个可靠的软件。从Google图像搜索下载图片（Python）

在Python中，我能否从**谷歌图片**搜索结果中下载所有/一些图像文件（例如JPG / PNG）？

如果有人了解这些标签的任何信息，并且它们是否存在于某个地方/与图像相关联？ https://en.wikipedia.org/wiki/Google_Image_Labeler

import json
import os
import time
import requests
from PIL import Image
from StringIO import StringIO
from requests.exceptions import ConnectionError

def go(query, path):
"""Download full size images from Google image search.
Don't print or republish images without permission.
I used this to train a learning algorithm.
"""
BASE_URL = 'https://ajax.googleapis.com/ajax/services/search/images?'\
         'v=1.0&q=' + query + '&start=%d'

BASE_PATH = os.path.join(path, query)

 if not os.path.exists(BASE_PATH):
 os.makedirs(BASE_PATH)

start = 0 # Google's start query string parameter for pagination.
while start < 60: # Google will only return a max of 56 results.
r = requests.get(BASE_URL % start)
for image_info in json.loads(r.text)['responseData']['results']:
  url = image_info['unescapedUrl']
  try:
    image_r = requests.get(url)
  except ConnectionError, e:
    print 'could not download %s' % url
    continue

  # Remove file-system path characters from name.
  title = image_info['titleNoFormatting'].replace('/', '').replace('\\', '')

  file = open(os.path.join(BASE_PATH, '%s.jpg') % title, 'w')
  try:
    Image.open(StringIO(image_r.content)).save(file, 'JPEG')
  except IOError, e:
    # Throw away some gifs...blegh.
    print 'could not save %s' % url
    continue
  finally:
    file.close()

print start
start += 4 # 4 images per page.

# Be nice to Google and they'll be nice back :)
time.sleep(1.5)

# Example use
go('landscape', 'myDirectory')

更新

我能够按照这里指定的方式创建一个使用完整网络的自定义搜索，并成功执行以获取图像链接，但正如在之前的帖子中提到的那样，它们并不完全与正常的Google图像结果对齐。

- Nicole Wilson

1

这似乎是一个与Python相关的问题，而不是批处理文件。我会为您更新标签，但我建议阅读您使用的标签的信息页面。 - Dennis van Gils

感谢 @DennisvanGils - Nicole Wilson

如果你想知道为什么你自己的应用程序搜索结果与常规的Google图像搜索不同，那是因为Google根据你的cookie等更改了结果，而你的应用程序没有这些信息。 - Dennis van Gils

@DennisvanGils 更多的是关于更新的注释。但是谢谢，我想到了那样的情况。正如我所指出的，我需要的主要是能够高效地从每个图片链接下载图像，并尽可能地获取相关的alt标签。 - Nicole Wilson

不确定您是否仍在尝试使其工作。然而，Google 不仅关心您的 cookies，还关心您的用户代理字符串。爬取 Google 也不是一件简单的事情，因为他们认为这是违反他们的条款和条件，并且如果他们发现您正在进行爬取，他们会迅速阻止您。 - jsfan

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rafael Ribeiro · Accepted Answer

尝试使用ImageSoup模块。安装它很简单：

pip install imagesoup

一个示例代码：

>>> from imagesoup import ImageSoup
>>>
>>> soup = ImageSoup()
>>> images_wanted = 50
>>> query = 'landscape'
>>> images = soup.search(query, n_images=50)

现在你有一个包含来自Google Images的50张风景图片的列表。让我们先玩一下第一张：

>>> im = images[0]
>>> im.URL
https://static.pexels.com/photos/279315/pexels-photo-279315.jpeg
>>> im.size
(2600, 1300)
>>> im.mode
RGB
>>> im.dpi
(300, 300)
>>> im.color_count
493230
>>> # Let's check the main 4 colors in the image. We use
>>> # reduce_size = True to speed up the process.
>>> im.main_color(reduce_size=True, n=4))
[('black', 0.2244), ('darkslategrey', 0.1057), ('darkolivegreen', 0.0761), ('dodgerblue', 0.0531)]
# Let's take a look on our image
>>> im.show()

>>> # Nice image! Let's save it.
>>> im.to_file('landscape.jpg')

每次搜索返回的图片数量可能会有所变化。通常是小于900的数字。如果您想获取所有图像，请将n_images设置为1000。

要贡献或报告错误，请查看github存储库：https://github.com/rafpyprog/ImageSoup