如何在Python Scrapy上禁用SSL验证?

3

我已经使用PHP编写数据抓取脚本三年了。

这是一个简单的PHP脚本。

$url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY';
$fields = array(
    'p_entity_name' => urlencode('AAA'),
    'p_name_type' => urlencode('A'),
    'p_search_type' => urlencode('BEGINS')
);
//url-ify the data for the POST
foreach ($fields as $key => $value) {
    $fields_string .= $key . '=' . $value . '&';
}
$fields_string = rtrim($fields_string, '&');
//open connection
$ch = curl_init();
//set the url, number of POST vars, POST data
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_POST, count($fields));
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields_string);
//execute post
$result = curl_exec($ch);
print curl_error($ch) . '<br>';
print curl_getinfo($ch, CURLINFO_HTTP_CODE) . '<br>';
print $result;

只有当CURLOPT_SSL_VERIFYPEERfalse时才能正常工作。如果启用CURLOPT_SSL_VERIFYPEER或使用http而不是https,它将返回空响应。

但是,我必须在Python Scrapy中完成同样的项目,这是Scrapy中相同的代码。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http.request import Request
import urllib
from appext20.items import Appext20Item

class Appext20Spider(CrawlSpider):
    name = "appext20"
    allowed_domains = ["appext20.dos.ny.gov"]
    DOWNLOAD_HANDLERS = {
    'https': 'my.custom.downloader.handler.https.HttpsDownloaderIgnoreCNError',}
    def start_requests(self):
        payload = {"p_entity_name": 'AMEB', "p_name_type": 'A', 'p_search_type':'BEGINS'}
        url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY'
        yield Request(url, self.parse_data, method="POST", body=urllib.urlencode(payload))

    def parse_data(self, response):
        print('here is repos')
        print response

它返回了空响应。需要禁用SSL验证。

请原谅我在Python Scrapy方面的知识不足,我已经搜索了很多,但没有找到任何解决方案。


你的爬虫代码中使用了"http://"方案。我相信你想要使用 url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY' 。除此之外,Scrapy 1.1以上版本默认跳过对等证书验证。你能分享一下日志并告诉我们遇到了什么问题吗? - paul trmbrth
@paultrmbrth 对于混淆感到抱歉,我已经在Scrapy代码中使用了https...但它返回空响应...你能告诉我在哪里查看日志吗?你是指运行代码后终端的完整输出吗?如果是这样,在我运行Scrapy代码后,这是完整的输出...http://www.beetxt.com/printable.php?view=Jnw - Umair Ayub
2个回答

1

1
我已经看到了你提到的问题的答案.. 但我不确定在哪里编写那行代码。你知道吗? - Umair Ayub

0

这段代码对我有效

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import FormRequest


import urllib
from appext20.items import Appext20Item
from scrapy.selector import HtmlXPathSelector

class Appext20Spider(CrawlSpider):
    name = "appext20"
    allowed_domains = ["appext20.dos.ny.gov"]
    payload = {"p_entity_name": 'AME', "p_name_type": 'A', 'p_search_type':'BEGINS'}

    def start_requests(self):
        url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY'
        return [ FormRequest(url,
                     formdata= self.payload,
                     callback=self.parse_data) ]

    def parse_data(self, response):
        print('here is response')
        questions = HtmlXPathSelector(response).xpath("//td[@headers='c1']")
        # print questions
        all_links = [] 
        for tr in questions:
            temp_dict = {}
            temp_dict['link'] = tr.xpath('a/@href').extract()
            temp_dict['title'] = tr.xpath('a/text()').extract()
            all_links.extend([temp_dict])
        print (all_links)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接