我已经使用PHP编写数据抓取脚本三年了。
这是一个简单的PHP脚本。
$url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY';
$fields = array(
'p_entity_name' => urlencode('AAA'),
'p_name_type' => urlencode('A'),
'p_search_type' => urlencode('BEGINS')
);
//url-ify the data for the POST
foreach ($fields as $key => $value) {
$fields_string .= $key . '=' . $value . '&';
}
$fields_string = rtrim($fields_string, '&');
//open connection
$ch = curl_init();
//set the url, number of POST vars, POST data
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_POST, count($fields));
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields_string);
//execute post
$result = curl_exec($ch);
print curl_error($ch) . '<br>';
print curl_getinfo($ch, CURLINFO_HTTP_CODE) . '<br>';
print $result;
只有当CURLOPT_SSL_VERIFYPEER
为false
时才能正常工作。如果启用CURLOPT_SSL_VERIFYPEER
或使用http
而不是https
,它将返回空响应。
但是,我必须在Python Scrapy中完成同样的项目,这是Scrapy中相同的代码。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http.request import Request
import urllib
from appext20.items import Appext20Item
class Appext20Spider(CrawlSpider):
name = "appext20"
allowed_domains = ["appext20.dos.ny.gov"]
DOWNLOAD_HANDLERS = {
'https': 'my.custom.downloader.handler.https.HttpsDownloaderIgnoreCNError',}
def start_requests(self):
payload = {"p_entity_name": 'AMEB', "p_name_type": 'A', 'p_search_type':'BEGINS'}
url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY'
yield Request(url, self.parse_data, method="POST", body=urllib.urlencode(payload))
def parse_data(self, response):
print('here is repos')
print response
它返回了空响应。需要禁用SSL验证。
请原谅我在Python Scrapy方面的知识不足,我已经搜索了很多,但没有找到任何解决方案。
url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY'
。除此之外,Scrapy 1.1以上版本默认跳过对等证书验证。你能分享一下日志并告诉我们遇到了什么问题吗? - paul trmbrth