Scrapy无法将数据写入文件

4
他在Scrapy中创建了一个爬虫: items.py:
from scrapy.item import Item, Field

class dns_shopItem (Item):
     # Define the fields for your item here like:
     # Name = Field ()
     id = Field ()
     idd = Field ()

dns_shop_spider.py:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader.processor import TakeFirst
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
from dns_shop.items import dns_shopItem
 
class dns_shopLoader (XPathItemLoader):
     default_output_processor = TakeFirst ()
 
class dns_shopSpider (CrawlSpider):
    name = "dns_shop_spider"
    allowed_domains = ["www.playground.ru"]
    start_urls = ["http://www.playground.ru/files/stalker_clear_sky/"]
    rules = (
    Rule (SgmlLinkExtractor (allow = ('/ files / s_t_a_l_k_e_r_chistoe_nebo')), follow = True),
    Rule (SgmlLinkExtractor (allow = ('/ files / s_t_a_l_k_e_r_chistoe_nebo')), callback = 'parse_item'),
    )

    def parse_item (self, response):
        hxs = HtmlXPathSelector (response)
        l = dns_shopLoader (dns_shopItem (), hxs)
        l.add_xpath ('id', "/ html / body / table [2] / tbody / tr [5] / td [2] / table / tbody / tr / td / div [6] / h1/text ()" )
        l.add_xpath ('idd', "/ / html / body / table [2] / tbody / tr [5] / td [2] / table / tbody / tr / td / div [6] / h1/text () ")
        return l.load_item ()

运行以下命令:
scrapy crawl dns_shop_spider-o scarped_data_utf8.csv-t csv

这个日志显示Scrapy已经通过了所有必要的URL,但是在启动爬虫时为什么没有写入指定的文件呢?可能出现了什么问题?
1个回答

2
假设您想要跟随页面上的所有链接http://www.playground.ru/files/stalker_clear_sky/,并获取标题、URL和下载链接:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader.processor import TakeFirst
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector

from scrapy.item import Item, Field


class PlayGroundItem(Item):
    title = Field()
    url = Field()
    download_url = Field()


class PlayGroundLoader(XPathItemLoader):
    default_output_processor = TakeFirst()


class PlayGroundSpider(CrawlSpider):
    name = "playground_spider"
    allowed_domains = ["www.playground.ru"]
    start_urls = ["http://www.playground.ru/files/stalker_clear_sky/"]
    rules = (
        Rule(SgmlLinkExtractor(allow=('/files/s_t_a_l_k_e_r_chistoe_nebo')), follow=True, callback='parse_item'),
    )


    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        l = PlayGroundLoader(PlayGroundItem(), hxs)
        l.add_value('url', response.url)
        l.add_xpath('title', "//div[@class='downloads-container clearfix']/h1/text()")
        l.add_xpath('download_url', "//div[@class='files-download-holder']/div/a/@href")

        return l.load_item()

将其保存到spider.py并通过以下方式运行:

scrapy runspider test_scrapy.py -o output.json

然后检查 output.json 文件。

希望这有所帮助。


我无法弄清楚在哪里点击某个东西,这会让你得到加薪吗? - user2420607
已勾选响应勾号。仍想问一下为什么我的xpath查询没有起作用,而你的却可以?它们是: l.add_xpath('title', "//div[@class='downloads-container clearfix']/h1/text()") 和: l.add_xpath('title', ".//*[@id='mainTable']/tbody/tr[5]/td[2]/table/tbody/tr/td/div[6]/h1/text()") 只运行第一个。我使用Mozilla Firefox的Firebug编写了xpath查询。那你是如何编写xpath查询的呢? - user2420607
是的,我也使用浏览器开发者工具。但是它们生成的XPath通常可以被大幅简化。 - alecxe
你都使用哪些开发工具来编写XPath查询? - user2420607
我使用Chrome开发者工具来检查页面上的元素。通常情况下,元素可以通过其本身或其父元素的idclass轻松找到页面上的位置。 - alecxe

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接