我正在努力理解Scrapy,但却遇到了一些难题。
页面上有两个表格,我想从每个表格中提取数据,然后转移到下一页。
这些表格看起来像这样(第一个称为Y1,第二个是Y2),结构相同。
<div id="Y1" style="margin-bottom: 0px; margin-top: 15px;">
<h2>First information</h2><hr style="margin-top: 5px; margin-bottom: 10px;">
<table class="table table-striped table-hover table-curved">
<thead>
<tr>
<th class="tCol1" style="padding: 10px;">First Col Head</th>
<th class="tCol2" style="padding: 10px;">Second Col Head</th>
<th class="tCol3" style="padding: 10px;">Third Col Head</th>
</tr>
</thead>
<tbody>
<tr>
<td>Info 1</td>
<td>Monday 5 September, 2016</td>
<td>Friday 21 October, 2016</td>
</tr>
<tr class="vevent">
<td class="summary"><b>Info 2</b></td>
<td class="dtstart" timestamp="1477094400"><b></b></td>
<td class="dtend" timestamp="1477785600">
<b>Sunday 30 October, 2016</b></td>
</tr>
<tr>
<td>Info 3</td>
<td>Monday 31 October, 2016</td>
<td>Tuesday 20 December, 2016</td>
</tr>
<tr class="vevent">
<td class="summary"><b>Info 4</b></td>
<td class="dtstart" timestamp="1482278400"><b>Wednesday 21 December, 2016</b></td>
<td class="dtend" timestamp="1483315200">
<b>Monday 2 January, 2017</b></td>
</tr>
</tbody>
</table>
正如您所看到的,这个结构有点不一致,但只要我能够获得每个td并将其输出到csv中,我就会非常高兴。
我尝试使用xPath,但这只让我更加困惑。
我的最后一次尝试:
import scrapy
class myScraperSpider(scrapy.Spider):
name = "myScraper"
allowed_domains = ["mysite.co.uk"]
start_urls = (
'https://mysite.co.uk/page1/',
)
def parse_products(self, response):
products = response.xpath('//*[@id="Y1"]/table')
# ignore the table header row
for product in products[1:]
item = Schooldates1Item()
item['hol'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[1]').extract()[0]
item['first'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[2]').extract()[0]
item['last'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[3]').extract()[0]
yield item
没有错误,但是它只会反馈出关于爬取的大量信息,而没有实际结果。
更新:
import scrapy
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = (
'https://termdates.co.uk/school-holidays-16-19-abingdon/',
)
def parse_products(self, response):
products = sel.xpath('//*[@id="Year1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[1]/text()').extract_first()
item['last'] = p.xpath('td[1]/text()').extract_first()
yield item
这给我带来了一个错误:IndentationError: unexpected indent 如果我运行下面修改后的脚本(感谢@Granitosaurus),输出到CSV文件(-o schoolDates.csv),我会得到一个空文件:
import scrapy
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)
def parse_products(self, response):
products = sel.xpath('//*[@id="Year1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[1]/text()').extract_first()
item['last'] = p.xpath('td[1]/text()').extract_first()
yield item
这是日志:
- 2017-03-23 12:04:08 [scrapy.core.engine] INFO: Spider已打开 2017-03-23 12:04:08 [scrapy.extensions.logstats] INFO: 爬取了0个页面(每分钟0个页面),抓取了0个项目(每分钟0个项目)2017-03-23 12:04:08 [scrapy.extensions.telnet] DEBUG: Telnet控制台正在监听... 2017-03-23 12:04:08 [scrapy.core.engine] DEBUG: 爬取(200) https://termdates.co.uk/robots.txt> (引用者:无) 2017-03-23 12:04:08 [scrapy.core.engine] DEBUG: 爬取(200)https://termdates.co.uk/school-holidays-16-19-abingdon/> (引用者: 无) 2017-03-23 12:04:08 [scrapy.core.scraper] ERROR: 爬虫处理 https://termdates.co.uk/school-holidays-16-19-abingdon/> 出错(引用者:无) Traceback (most recent call last): File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 653, in _ runCallbacks current.result = callback(current.result, *args, **kw) File "c:\python27\lib\site-packages\scrapy-1.3.3-py2.7.egg\scrapy\spiders__init__.py", line 76, in parse raise NotImplementedError NotImplementedError 2017-03-23 12:04:08 [scrapy.core.engine] INFO: 关闭爬虫(完成)2017-03-23 12:04:08 [scrapy.statscollectors] INFO: 正在转储Scrapy统计信息: {'downloader/request_bytes': 467, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 11311, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 3, 23, 12, 4, 8, 845000), 'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/NotImplementedError': 1, 'start_time': datetime.datetime(2017, 3, 23, 12, 4, 8, 356000)} 2017-03-23 12:04:08 [scrapy.core.engine] INFO: 爬虫关闭(完成)
更新2:(跳过行) 这将结果推送到csv文件,但跳过每隔一行。
Shell 显示{'hol': None, 'last': u'\r\n\t\t\t\t\t\t\t\t', 'first': None}
。import scrapy
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)
def parse(self, response):
products = response.xpath('//*[@id="Year1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[2]/text()').extract_first()
item['last'] = p.xpath('td[3]/text()').extract_first()
yield item
解决方案: 感谢@vold的帮助,这个程序会爬取在start_urls中的所有页面并处理不一致的表格布局。# -*- coding: utf-8 -*-
import scrapy
from SchoolDates_1.items import Schooldates1Item
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',
'https://termdates.co.uk/school-holidays-3-dimensions',)
def parse(self, response):
products = response.xpath('//*[@id="Year1"]/table//tr')
# ignore the table header row
for product in products[1:]:
item = Schooldates1Item()
item['hol'] = product.xpath('td[1]//text()').extract_first()
item['first'] = product.xpath('td[2]//text()').extract_first()
item['last'] = ''.join(product.xpath('td[3]//text()').extract()).strip()
item['url'] = response.url
yield item
tbody
。 - voldparse_products
更改为parse
。出现错误:NameError: global name 'sel' is not defined - 已修复 - 将 'sel.xpath' 更改为 'response.xpath' - stonk