Scrapy - 从表格中提取项目

Question

Scrapy - 从表格中提取项目

10

我正在努力理解Scrapy，但却遇到了一些难题。

页面上有两个表格，我想从每个表格中提取数据，然后转移到下一页。

这些表格看起来像这样（第一个称为Y1，第二个是Y2），结构相同。

<div id="Y1" style="margin-bottom: 0px; margin-top: 15px;">
                                <h2>First information</h2><hr style="margin-top: 5px; margin-bottom: 10px;">                    

                <table class="table table-striped table-hover table-curved">
                    <thead>
                        <tr>
                            <th class="tCol1" style="padding: 10px;">First Col Head</th>
                            <th class="tCol2" style="padding: 10px;">Second Col Head</th>
                            <th class="tCol3" style="padding: 10px;">Third Col Head</th>
                        </tr>
                    </thead>
                    <tbody>

                        <tr>
                            <td>Info 1</td>
                            <td>Monday 5 September, 2016</td>
                            <td>Friday 21 October, 2016</td>
                        </tr>
                        <tr class="vevent">
                            <td class="summary"><b>Info 2</b></td>
                            <td class="dtstart" timestamp="1477094400"><b></b></td>
                            <td class="dtend" timestamp="1477785600">
                            <b>Sunday 30 October, 2016</b></td>
                        </tr>
                        <tr>
                            <td>Info 3</td>
                            <td>Monday 31 October, 2016</td>
                            <td>Tuesday 20 December, 2016</td>
                        </tr>


                    <tr class="vevent">
                        <td class="summary"><b>Info 4</b></td>                      
                        <td class="dtstart" timestamp="1482278400"><b>Wednesday 21 December, 2016</b></td>
                        <td class="dtend" timestamp="1483315200">
                        <b>Monday 2 January, 2017</b></td>
                    </tr>



                </tbody>
            </table>

正如您所看到的，这个结构有点不一致，但只要我能够获得每个td并将其输出到csv中，我就会非常高兴。

我尝试使用xPath，但这只让我更加困惑。

我的最后一次尝试：

import scrapy

class myScraperSpider(scrapy.Spider):
name = "myScraper"

allowed_domains = ["mysite.co.uk"]
start_urls =    (
                'https://mysite.co.uk/page1/',
                )

def parse_products(self, response):
    products = response.xpath('//*[@id="Y1"]/table')
    # ignore the table header row
    for product in products[1:]  
       item = Schooldates1Item()
       item['hol'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[1]').extract()[0]
       item['first'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[2]').extract()[0]
       item['last'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[3]').extract()[0]
       yield item

没有错误，但是它只会反馈出关于爬取的大量信息，而没有实际结果。

更新：

  import scrapy

       class SchoolSpider(scrapy.Spider):
name = "school"

allowed_domains = ["termdates.co.uk"]
start_urls =    (
                'https://termdates.co.uk/school-holidays-16-19-abingdon/',
                )

  def parse_products(self, response):
  products = sel.xpath('//*[@id="Year1"]/table//tr')
 for p in products[1:]:
  item = dict()
  item['hol'] = p.xpath('td[1]/text()').extract_first()
  item['first'] = p.xpath('td[1]/text()').extract_first()
  item['last'] = p.xpath('td[1]/text()').extract_first()
  yield item

这给我带来了一个错误：IndentationError: unexpected indent

如果我运行下面修改后的脚本（感谢@Granitosaurus），输出到CSV文件（-o schoolDates.csv），我会得到一个空文件：

import scrapy

class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)

def parse_products(self, response):
    products = sel.xpath('//*[@id="Year1"]/table//tr')
    for p in products[1:]:
        item = dict()
        item['hol'] = p.xpath('td[1]/text()').extract_first()
        item['first'] = p.xpath('td[1]/text()').extract_first()
        item['last'] = p.xpath('td[1]/text()').extract_first()
        yield item

这是日志：

2017-03-23 12:04:08 [scrapy.core.engine] INFO: Spider已打开 2017-03-23 12:04:08 [scrapy.extensions.logstats] INFO: 爬取了0个页面（每分钟0个页面），抓取了0个项目（每分钟0个项目）2017-03-23 12:04:08 [scrapy.extensions.telnet] DEBUG: Telnet控制台正在监听... 2017-03-23 12:04:08 [scrapy.core.engine] DEBUG: 爬取（200） https://termdates.co.uk/robots.txt> (引用者：无) 2017-03-23 12:04:08 [scrapy.core.engine] DEBUG: 爬取（200）https://termdates.co.uk/school-holidays-16-19-abingdon/> (引用者：无) 2017-03-23 12:04:08 [scrapy.core.scraper] ERROR: 爬虫处理 https://termdates.co.uk/school-holidays-16-19-abingdon/> 出错（引用者：无） Traceback (most recent call last): File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 653, in _ runCallbacks current.result = callback(current.result, *args, **kw) File "c:\python27\lib\site-packages\scrapy-1.3.3-py2.7.egg\scrapy\spiders__init__.py", line 76, in parse raise NotImplementedError NotImplementedError 2017-03-23 12:04:08 [scrapy.core.engine] INFO: 关闭爬虫（完成）2017-03-23 12:04:08 [scrapy.statscollectors] INFO: 正在转储Scrapy统计信息： {'downloader/request_bytes': 467, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 11311, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 3, 23, 12, 4, 8, 845000), 'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/NotImplementedError': 1, 'start_time': datetime.datetime(2017, 3, 23, 12, 4, 8, 356000)} 2017-03-23 12:04:08 [scrapy.core.engine] INFO: 爬虫关闭（完成）

更新2：（跳过行）这将结果推送到csv文件，但跳过每隔一行。

Shell 显示{'hol': None, 'last': u'\r\n\t\t\t\t\t\t\t\t', 'first': None}

。

import scrapy

class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)

def parse(self, response):
    products = response.xpath('//*[@id="Year1"]/table//tr')
    for p in products[1:]:
        item = dict()
        item['hol'] = p.xpath('td[1]/text()').extract_first()
        item['first'] = p.xpath('td[2]/text()').extract_first()
        item['last'] = p.xpath('td[3]/text()').extract_first()
        yield item

解决方案: 感谢@vold的帮助，这个程序会爬取在start_urls中的所有页面并处理不一致的表格布局。

# -*- coding: utf-8 -*-
import scrapy
from SchoolDates_1.items import Schooldates1Item

class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',
              'https://termdates.co.uk/school-holidays-3-dimensions',)

def parse(self, response):
    products = response.xpath('//*[@id="Year1"]/table//tr')
    # ignore the table header row
    for product in products[1:]:
        item = Schooldates1Item()
        item['hol'] = product.xpath('td[1]//text()').extract_first()
        item['first'] = product.xpath('td[2]//text()').extract_first()
        item['last'] = ''.join(product.xpath('td[3]//text()').extract()).strip()
        item['url'] = response.url
        yield item

- stonk

2

请提供更多信息：您尝试了什么？哪些代码？哪个XPATH表达式让您感到困惑？您是否阅读了有关选择器的Scrapy教程？ - rfelten

嗨，rfelten，我已经在上面添加了我的最新代码。谢谢。 - stonk

你能提供一下你想解析的网站链接吗？另外，尽量不要在xpath表达式中使用 tbody。 - vold

@vold 有没有不使用 tbody 的理由？看起来这是避免解析标题行的明显方法。 - Granitosaurus

我将 parse_products 更改为 parse。出现错误：NameError: global name 'sel' is not defined - 已修复 - 将 'sel.xpath' 更改为 'response.xpath' - stonk

显示剩余7条评论

3个回答

5

您可以使用CSS选择器而不是xPaths，我发现CSS选择器更容易。

def parse_products(self, response):

    for table in response.css("#Y1 table")[1:]:
       item = Schooldates1Item()
       item['hol'] = product.css('td:nth-child(1)::text').extract_first()
       item['first'] = product.css('td:nth-child(2)::text').extract_first()
       item['last'] = product.css('td:nth-child(3)::text').extract_first()
       yield item

同时不要在选择器中使用tbody标签。资料来源:

使用Firefox浏览器时，会向表格添加<tbody>元素。而Scrapy并不会修改原始页面HTML，因此如果在XPath表达式中使用<tbody>，您将无法提取任何数据。

- Umair Ayub

无论您使用CSS还是XPath，这种情况下XPath甚至更加直观，例如td [1]。 - Granitosaurus

0

我已经用你提供的HTML源代码，使用这些XPath使其正常工作：

products = sel.xpath('//*[@id="Y1"]/table//tr')
for p in products[1:]:
    item = dict()
    item['hol'] = p.xpath('td[1]/text()').extract_first()
    item['first'] = p.xpath('td[1]/text()').extract_first()
    item['last'] = p.xpath('td[1]/text()').extract_first()
    yield item

以上假设每个表行只包含一个项目。

- Granitosaurus

TBODY是由浏览器（如Mozilla和Chrome）添加的，它在HTML响应的源代码中不存在，因此您的xpath无法工作。 - Umair Ayub

@Umair，就 OP 代码的背景而言，它应该能够正常工作：P。此外，您暗示 OP 没有使用浏览器或某些渲染来下载源代码。因此，在这个问题的背景下，我的原始答案将可以正常工作，但还是调整了答案以反映您的观点。 - Granitosaurus

感谢大家的参与。请查看我的编辑，其中包含我正在尝试爬取的网站。 - stonk

快速提问 - 为什么每个xpath都是 td[1] - 是不是被 .extract_first() 移除了 td？ - David

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- vold · Accepted Answer

你需要稍微修改一下你的代码。由于你已经选择了表格内的所有元素，所以不需要再次指向表格。因此，你可以将你的xpath缩短为像这样td[1]//text()。

def parse_products(self, response):
    products = response.xpath('//*[@id="Year1"]/table//tr')
    # ignore the table header row
    for product in products[1:]  
       item = Schooldates1Item()
       item['hol'] = product.xpath('td[1]//text()').extract_first()
       item['first'] = product.xpath('td[2]//text()').extract_first()
       item['last'] = product.xpath('td[3]//text()').extract_first()
       yield item

自@stutray提供了一个网站链接后，我编辑了我的答案。