处理Scrapy Div Class

Question

处理Scrapy Div Class

5

我对Scrapy和Python都很陌生。我正在尝试编写一个爬取文章标题、链接和类似于RSS源的文章描述的爬虫，以帮助我的论文。我已经编写了以下的爬虫代码，但当我运行并导出为.txt文件时，返回的结果是空白的。我认为需要添加Item Loader，但我不确定。

Items.py

from scrapy.item import Item, Field

class NorthAfricaItem(Item):
    title = Field()
    link = Field()
    desc = Field()
    pass

蜘蛛

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from northafricatutorial.items import NorthAfricaItem

class NorthAfricaItem(BaseSpider):
   name = "northafrica"
   allowed_domains = ["http://www.north-africa.com/"]
   start_urls = [
       "http://www.north-africa.com/naj_news/news_na/index.1.html",
   ]

 def parse(self, response):
 hxs = HtmlXPathSelector(response)
 sites = hxs.select('//ul/li')
 items = []
 for site in sites:
     item = NorthAfricaItem()
     item['title'] = site.select('a/text()').extract()
     item['link'] = site.select('a/@href').extract()
     item['desc'] = site.select('text()').extract()
     items.append(item)
 return items

更新

感谢Talvalin的帮助，通过一些尝试我已经解决了问题。我使用了一份在网上找到的基础脚本，但是当我使用shell时，我找到了正确的标签以获取所需内容。最终结果如下：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from northafrica.items import NorthAfricaItem

class NorthAfricaSpider(BaseSpider):
   name = "northafrica"
   allowed_domains = ["http://www.north-africa.com/"]
   start_urls = [
       "http://www.north-africa.com/naj_news/news_na/index.1.html",
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//ul/li')
       items = []
       for site in sites:
           item = NorthAfricaItem()
           item['title'] = site.select('//div[@class="short_holder"]    /h2/a/text()').extract()
       item['link'] = site.select('//div[@class="short_holder"]/h2/a/@href').extract()
       item['desc'] = site.select('//span[@class="summary"]/text()').extract()
       items.append(item)
   return items

如果有人发现我做错了什么，请告诉我......但它有效。

- Mike

-----更新------- 我回到了shell去找出为什么它不起作用。结果发现我使用了错误的选择器。所以现在我已经使用了.......... >>> hxs.select('//div[@class="short_holder"]/h2').extract()，这给了我我想要的，但最终我得到了[u'<h2><a href="naj_news/news_na/2janseventeen48.html">Terror Attack on Gas Site: Algeria Faces Greatest Crisis in Decades </a></h2>'，我正在努力找出如何提取文本。这是一个嵌套函数吗？ - Mike

你修改了哪个选择器？能否编辑你的问题并展示更新后的代码？ - Talvalin

@Mike 你不应该在问题得到解答后删除它，因为这可能会帮助到其他遇到相同问题的人。相反，你可以添加更新内容。我已经编辑了你的问题，如果你愿意，你可以回滚它 :) - jadkik94

我仍在努力弄清楚这一切！抱歉。 :) - Mike

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Talvalin · Accepted Answer

这段代码需要注意的是它会出现错误。尝试通过命令行运行蜘蛛程序，你会看到类似以下内容的提示：

        exceptions.TypeError: 'NorthAfricaItem' object does not support item assignment

2013-01-24 16:43:35+0000 [northafrica] INFO: Closing spider (finished)

发生此错误的原因是您给蜘蛛和项目类相同的名称：NorthAfricaItem。

在蜘蛛代码中，当您创建一个NorthAfricaItem实例来分配事物（如标题、链接和描述）时，蜘蛛版本优先于项目版本。由于NorthAfricaItem的蜘蛛版本实际上不是Item类型，因此项目分配失败。

要解决此问题，请将蜘蛛类重命名为NorthAfricaSpider之类的名称，问题就应该得到解决。