使用Scrapy提取图片

4

我已经阅读了其他答案,但我还是缺少一些基本知识。我正在尝试使用CrawlSpider从网站中提取图像。

settings.py

BOT_NAME = 'healthycomm'

SPIDER_MODULES = ['healthycomm.spiders']
NEWSPIDER_MODULE = 'healthycomm.spiders'

ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
IMAGES_STORE = '~/Desktop/scrapy_nsml/healthycomm/images'

items.py

class HealthycommItem(scrapy.Item):
    page_heading = scrapy.Field()
    page_title = scrapy.Field()
    page_link = scrapy.Field()
    page_content = scrapy.Field()
    page_content_block = scrapy.Field()

    image_url = scrapy.Field()
    image = scrapy.Field()

HealthycommSpider.py

class HealthycommSpiderSpider(CrawlSpider):
    name = "healthycomm_spider"
    allowed_domains = ["healthycommunity.org.au"]
    start_urls = (
        'http://www.healthycommunity.org.au/',
    )
    rules = (Rule(SgmlLinkExtractor(allow=()), callback="parse_items", follow=False), ) 


    def parse_items(self, response):
        content = Selector(response=response).xpath('//body')
        for nodes in content:

            img_urls = nodes.xpath('//img/@src').extract()

            item = HealthycommItem()
            item['page_heading'] = nodes.xpath("//title").extract()
            item["page_title"] = nodes.xpath("//h1/text()").extract()
            item["page_link"] = response.url
            item["page_content"] = nodes.xpath('//div[@class="CategoryDescription"]').extract()
            item['image_url'] = img_urls 
            item['image'] = ['http://www.healthycommunity.org.au' + img for img in img_urls]

            yield item

总的来说,我对Python并不是很熟悉,但我感觉在这里我可能遗漏了一些非常基础的东西。

谢谢, Jamie


我认为在将图像附加到链接时,您漏掉了一个斜杠。我认为它应该是http://www.healthycommunity.org.au/。 - sundar nataraj
相对路径被返回,即: /path/path2/image.jpg - Jamie S
1
请查看此链接:https://dev59.com/k1_Va4cB1Zd3GeqPPQhW - sundar nataraj
1
结果发现我在items类中漏掉了一个“s” - 应该是image_urls而不是image_url。真令人沮丧。 - Jamie S
你的问题解决了吗? - sundar nataraj
1个回答

3

如果您想使用标准的ImagesPipeline,则需要将您的parse_items方法更改为以下内容:

import urlparse
...

    def parse_items(self, response):
        content = Selector(response=response).xpath('//body')
        for nodes in content:

            # build absolute URLs
            img_urls = [urlparse.urljoin(response.url, src)
                        for src in nodes.xpath('//img/@src').extract()]

            item = HealthycommItem()
            item['page_heading'] = nodes.xpath("//title").extract()
            item["page_title"] = nodes.xpath("//h1/text()").extract()
            item["page_link"] = response.url
            item["page_content"] = nodes.xpath('//div[@class="CategoryDescription"]').extract()

            # use "image_urls" instead of "image_url"
            item['image_urls'] = img_urls 

            yield item

你的项目定义需要 "images" 和 "image_urls" 字段(复数形式,而不是单数形式)

另一种方法是设置 IMAGES_URLS_FIELDIMAGES_RESULT_FIELD 来适应你的项目定义


urlparse.urljoin(response.url, src) 是否会考虑文档中可能存在的 <base> 标签? - sshine
1
@SimonShine,我不这么认为,但是新的response.urljoin(src)可以。请参见实现代码 - paul trmbrth

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接