我正在使用scrapy爬取一个域名下的所有网页。
我看到了这个问题:https://dev59.com/xV3Va4cB1Zd3GeqPFPGv。但是没有解决方案。我的问题似乎也很相似。我的爬取命令的输出结果如下:
这里的问题是爬虫可以从第一页找到链接,但不能访问它们。这样的爬虫有什么用呢?
编辑:
我的爬虫代码是:
我看到了这个问题:https://dev59.com/xV3Va4cB1Zd3GeqPFPGv。但是没有解决方案。我的问题似乎也很相似。我的爬取命令的输出结果如下:
scrapy crawl sjsu2012-02-22 19:41:35-0800 [scrapy] INFO: Scrapy 0.14.1 started (bot: sjsucrawler)
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled item pipelines:
2012-02-22 19:41:35-0800 [sjsu] INFO: Spider opened
2012-02-22 19:41:35-0800 [sjsu] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-02-22 19:41:35-0800 [sjsu] DEBUG: Crawled (200) <GET http://cs.sjsu.edu/> (referer: None)
2012-02-22 19:41:35-0800 [sjsu] INFO: Closing spider (finished)
2012-02-22 19:41:35-0800 [sjsu] INFO: Dumping spider stats:
{'downloader/request_bytes': 198,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 11000,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 788155),
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 379951)}
2012-02-22 19:41:35-0800 [sjsu] INFO: Spider closed (finished)
2012-02-22 19:41:35-0800 [scrapy] INFO: Dumping global stats:
{'memusage/max': 29663232, 'memusage/startup': 29663232}
这里的问题是爬虫可以从第一页找到链接,但不能访问它们。这样的爬虫有什么用呢?
编辑:
我的爬虫代码是:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class SjsuSpider(BaseSpider):
name = "sjsu"
allowed_domains = ["sjsu.edu"]
start_urls = [
"http://cs.sjsu.edu/"
]
def parse(self, response):
filename = "sjsupages"
open(filename, 'wb').write(response.body)
我所有的其它设定都是默认的。