我目前在使用Scrapy时遇到了一个问题。每当我使用Scrapy来爬取一个HTTPS网站,并且证书的CN值与服务器的域名匹配时,Scrapy运行良好!但是,当我尝试爬取证书的CN值与服务器的域名不匹配的网站时,我会遇到以下问题:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/protocols/tls.py", line 415, in dataReceived
self._write(bytes)
File "/usr/local/lib/python2.7/dist-packages/twisted/protocols/tls.py", line 554, in _write
sent = self._tlsConnection.send(toSend)
File "/usr/local/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 1270, in send
result = _lib.SSL_write(self._ssl, buf, len(buf))
File "/usr/local/lib/python2.7/dist-packages/OpenSSL/SSL.py", line 926, in wrapper
callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/_sslverify.py", line 1055, in infoCallback
return wrapped(connection, where, ret)
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/_sslverify.py", line 1154, in _identityVerifyingInfoCallback
verifyHostname(connection, self._hostnameASCII)
File "/usr/local/lib/python2.7/dist-packages/service_identity/pyopenssl.py", line 30, in verify_hostname
obligatory_ids=[DNS_ID(hostname)],
File "/usr/local/lib/python2.7/dist-packages/service_identity/_common.py", line 235, in __init__
raise ValueError("Invalid DNS-ID.")
exceptions.ValueError: Invalid DNS-ID.
我已经查阅了尽可能多的文档,据我所知Scrapy没有禁用SSL证书验证的方法。即使是Scrapy请求对象的文档(我认为这里应该有这个功能),也没有提到相关内容:
http://doc.scrapy.org/en/1.0/topics/request-response.html#scrapy.http.Request https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/init.py
此外,Scrapy没有解决这个问题的设置:
http://doc.scrapy.org/en/1.0/topics/settings.html
除了使用Scrapy源代码并根据需要修改源代码之外,是否有任何想法可以禁用SSL证书验证?
谢谢!
DOWNLOAD_HANDLERS
或DOWNLOAD_HANDLERS_BASE
设置来改变Scrapy处理https的方式。接下来,你可能需要创建自己修改过的HttpDownloadHandler
,以便能够解决你遇到的错误。 - Kyle Pittman