Scrapy - 制作一个用于爬取gif的网络爬虫

3

我希望写一个网络爬虫来下载gif图片,我正在使用scrapy。

在观看了一些YouTube教程后,我已经成功地编写了一个从imgur下载图像但无法下载gif的爬虫。因此,我搜索了这个问题并找到了https://github.com/scrapy/scrapy/issues/211,其中有人编写了一个新的图像管道来保存gif。但是当我尝试运行代码时,遇到了错误。

我的items.py代码如下:

import scrapy


class ImgurItem(scrapy.Item):
    title = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

settings.py

BOT_NAME = 'Imgur'

SPIDER_MODULES = ['imgur.spiders']
NEWSPIDER_MODULE = 'imgur.spiders'
ITEM_PIPELINES = { 'scrapy.contrib.pipeline.images.ImagesPipeline' :1}
IMAGES_STORE = 'E:\Results'
import scrapy

from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem

class ImgurSpider(CrawlSpider):
    name = 'imgur'
    allowed_domains = ['imgur.com']
    start_urls = ['http://imgur.com/gallery/hpqyKrW']
    rules = [Rule(LinkExtractor(allow=['/gallery/.*']), 'parse_imgur')]

    def parse_imgur(self, response):
        image = ImgurItem()
        image['title'] = response.xpath(\
            "//h2[@id='image-title']/text()").extract()
        rel = response.xpath("//img/@src").extract()
        image['image_urls'] = ['http:'+rel[0]]
        return image

以及 pipelines.py(来自上述链接的代码)

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline

class ImgurPipeline(object):
    def check_gif(self, image):
        if image.format == 'GIF':
            return True
        # The library reads GIF87a and GIF89a versions of the GIF file format.
        return image.info.get('version') in ['GIF89a', 'GIF87a']

    def persist_gif(self, key, data, info):
        root, ext = os.path.splitext(key)
        key = key + '.gif'
        absolute_path = self.store._get_filesystem_path(key)
        self.store._mkdir(os.path.dirname(absolute_path), info)
        f = open(absolute_path, 'wb')   # use 'b' to write binary data.
        f.write(data)

    def image_downloaded(self, response, request, info):
        checksum = None
        for key, image, buf in self.get_images(response, request, info):
            if checksum is None:
                buf.seek(0)
                checksum = md5sum(buf)
            if key.startswith('full') and self.check_gif(image):
                # Save gif from response directly.
                self.persist_gif(key, response.body, info)
            else:
                self.store.persist_image(key, image, buf, info)
        return checksum

现在我遇到的错误与连接问题有关,但我不知道如何处理。

2015-05-03 17:43:31+0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: Imgur)
2015-05-03 17:43:31+0500 [scrapy] INFO: Optional features available: ssl, http11

2015-05-03 17:43:31+0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'imgur.spiders', 'SPIDER_MODULES': ['imgur.spiders'], 'BOT_NAME': 'Imgur'}
2015-05-03 17:43:32+0500 [scrapy] INFO: Enabled extensions: LogStats, TelnetCons
ole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-03 17:43:32+0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-03 17:43:32+0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2015-05-03 17:43:32+0500 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2015-05-03 17:43:32+0500 [imgur] INFO: Spider opened
2015-05-03 17:43:32+0500 [imgur] INFO: Crawled 0 pages (at 0 pages/min), scraped
 0 items (at 0 items/min)
2015-05-03 17:43:32+0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2015-05-03 17:43:32+0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080

2015-05-03 17:43:33+0500 [imgur] DEBUG: Crawled (200) <GET http://imgur.com/gall
ery/hpqyKrW> (referer: None)
2015-05-03 17:43:33+0500 [imgur] DEBUG: Crawled (200) <GET http://imgur.com/gall
ery/hpqyKrW> (referer: http://imgur.com/gallery/hpqyKrW)
2015-05-03 17:43:33+0500 [imgur] DEBUG: Retrying <GET http:data:image/gif;base64
,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7> (failed 1 times): An
error occurred while connecting: 10049: The requested address is not valid in it
s context..
2015-05-03 17:43:33+0500 [imgur] DEBUG: Retrying <GET http:data:image/gif;base64
,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7> (failed 2 times): An
error occurred while connecting: 10049: The requested address is not valid in it
s context..
2015-05-03 17:43:33+0500 [imgur] DEBUG: Gave up retrying <GET http:> (failed 3 tim
es): An error occurred while connecting: 10049: The requested address is not val
id in its context..
2015-05-03 17:43:33+0500 [imgur] WARNING: File (unknown-error): Error downloadin
g image from <GET http:
ALAAAAAABAAEAAAIBRAA7> referred in <None>: An error occurred while connecting: 1
0049: The requested address is not valid in its context..
2015-05-03 17:43:33+0500 [imgur] DEBUG: Scraped from <200 http://imgur.com/galle
ry/hpqyKrW>
        {'image_urls': [u'http:
5BAEAAAAALAAAAAABAAEAAAIBRAA7'],
         'images': [],
         'title': []}
2015-05-03 17:43:33+0500 [imgur] INFO: Closing spider (finished)
2015-05-03 17:43:33+0500 [imgur] INFO: Dumping Scrapy stats:
        {'downloader/exception_count': 3,
         'downloader/exception_type_count/twisted.internet.error.ConnectError':
3,
         'downloader/request_bytes': 1329,
         'downloader/request_count': 5,
         'downloader/request_method_count/GET': 5,
         'downloader/response_bytes': 29755,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 2,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 5, 3, 12, 43, 33, 951000),
         'item_scraped_count': 1,
         'log_count/DEBUG': 8,
         'log_count/INFO': 7,
         'log_count/WARNING': 1,
         'request_depth_max': 1,
         'response_received_count': 2,
         'scheduler/dequeued': 2,
         'scheduler/dequeued/memory': 2,
         'scheduler/enqueued': 2,
         'scheduler/enqueued/memory': 2,
         'start_time': datetime.datetime(2015, 5, 3, 12, 43, 32, 424000)}
2015-05-03 17:43:33+0500 [imgur] INFO: Spider closed (finished)

我应该如何处理这个错误:

0049: 请求的地址在其上下文中无效

非常感谢任何帮助、新想法和任何允许我爬取一些 GIF 的东西。

1个回答

0

你正在做的两件重要的事情不正确:

  • 你的管道类应该继承自ImagesPipeline:

    class ImgurPipeline(ImagesPipeline):
    
  • 你应该在设置中启用你的管道,将:

    ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
    

    替换为:

    ITEM_PIPELINES = {'imgur.pipelines.ImgurPipeline' :1}
    

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接