使用Python的Requests发送一个ASP.net POST请求

11

我正在使用Python的requests模块来爬取一个旧的ASP.net网站。

我已经花费了5个小时以上的时间,试图模拟这个POST请求,但一直没有成功。按照下面的方法进行操作,我基本上会收到一条消息,上面写着“没有与该项引用相匹配的项目。”

如果有帮助,将不胜感激 - 这是我的请求和代码,出于简洁性和/或隐私方面的考虑,已经进行了一些修改:

我的代码:

import requests

# Scraping the item number from the website, I have confirmed this is working.

#Then use the newly acquired item number to request the data.
item_url = http://www.example.com/EN/items/Pages/yourrates.aspx?vr= + item_number[0]
viewstate = r'/wEPD...' # Truncated for brevity.

# Create the appropriate request and payload.
payload = {"vr": int(item_number[0])}

item_request_body = {
        "__SPSCEditMenu": "true",
        "MSOWebPartPage_PostbackSource": "",
        "MSOTlPn_SelectedWpId": "",
        "MSOTlPn_View": 0,
        "MSOTlPn_ShowSettings": "False",
        "MSOGallery_SelectedLibrary": "",
        "MSOGallery_FilterString": "",
        "MSOTlPn_Button": "none",
        "__EVENTTARGET": "",
        "__EVENTARGUMENT": "",
        "MSOAuthoringConsole_FormContext": "",
        "MSOAC_EditDuringWorkflow": "",
        "MSOSPWebPartManager_DisplayModeName": "Browse",
        "MSOWebPartPage_Shared": "",
        "MSOLayout_LayoutChanges": "",
        "MSOLayout_InDesignMode": "",
        "MSOSPWebPartManager_OldDisplayModeName": "Browse",
        "MSOSPWebPartManager_StartWebPartEditingName": "false",
        "__VIEWSTATE": viewstate,
        "keywords": "Search our site",
        "__CALLBACKID": "ctl00$SPWebPartManager1$g_dbb9e9c7_fe1d_46df_8789_99a6c9db4b22",
        "__CALLBACKPARAM": "startvr"
    }

# Write the appropriate headers for the property information.
item_request_headers = {
    "Host": home_site,
    "Connection": "keep-alive",
    "Content-Length": len(encoded_valuation_request),
    "Cache-Control": "max-age=0",
    "Origin": home_site,
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36",
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "Cookie": "__utma=48409910.1174413745.1405662151.1406402487.1406407024.17; __utmb=48409910.7.10.1406407024; __utmc=48409910; __utmz=48409910.1406178827.13.3.utmcsr=ratesandvallandingpage|utmccn=landingpages|utmcmd=button",
    "Accept": "*/*",
    "Referer": valuation_url,
    "Accept-Encoding": "gzip,deflate,sdch",
    "Accept-Language": "en-US,en;q=0.8"
}

    response = requests.post(url=item_url, params=payload, data=item_request_body, headers=item_request_headers)
    print response.text

Chrome告诉我的请求看起来像什么:

Remote Address:202.55.96.131:80
Request URL:http://www.example.com/EN/items/Pages/yourrates.aspx?vr=123456789
Request Method:POST
Status Code:200 OK

Request Headers
Accept:*/*
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Content-Length:21501
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:__utma=48409910.1174413745.1405662151.1406402487.1406407024.17; __utmb=48409910.7.10.1406407024; __utmc=48409910; __utmz=48409910.1406178827.13.3.utmcsr=ratesandvallandingpage|utmccn=landingpages|utmcmd=button
Host:www.site.com
Origin:www.site.com
Referer:http://www.example.com/EN/items/Pages/yourrates.aspx?vr=123456789
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36

Query String Parameters
vr:123456789

Form Data
__SPSCEditMenu:true
MSOWebPartPage_PostbackSource:
MSOTlPn_SelectedWpId:
MSOTlPn_View:0
MSOTlPn_ShowSettings:False
MSOGallery_SelectedLibrary:
MSOGallery_FilterString:
MSOTlPn_Button:none
__EVENTTARGET:
__EVENTARGUMENT:
MSOAuthoringConsole_FormContext:
MSOAC_EditDuringWorkflow:
MSOSPWebPartManager_DisplayModeName:Browse
MSOWebPartPage_Shared:
MSOLayout_LayoutChanges:
MSOLayout_InDesignMode:
MSOSPWebPartManager_OldDisplayModeName:Browse
MSOSPWebPartManager_StartWebPartEditingName:false
__VIEWSTATE:/wEPD...(Omitted for length)
keywords:Search our site
__CALLBACKID:ctl00$SPWebPartManager1$g_dbb9e9c7_fe1d_46df_8789_99a6c9db4b22
__CALLBACKPARAM:startvr

不确定是否有帮助,但我认为你的 item_url 目前构造错误,它不是一个字符串。 - Anshul Goyal
没错,没注意到,但那不是我的问题,因为我正在重新格式化内容以排除实际的URL :) 不过还是谢谢你发现了这个问题! - David K.
"Event" 和 "ViewState" 验证,除了下面提到的可能的 "session" 外,都是可能性。 - EdSF
2个回答

19

您的请求参数太多了,不应设置内容类型(content-type)、内容长度(content-length)、主机(host)、来源(origin)或连接(connection)头;这些应该留给requests来设置。

您还重复了URL参数;要么手动将vr参数添加到URL中,要么使用params,不要同时做两者。

很可能POST正文中的某些参数是由与会话相关联的ASP应用程序生成的。我会使用GET请求和Session对象valuation_url上进行解析并提取表单中的__CALLBACKID参数。然后,请求会话将存储服务器设置的任何cookie并重复使用它们:

item_request_headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36",
    "Accept": "*/*",
    "Accept-Encoding": "gzip,deflate,sdch",
    "Accept-Language": "en-US,en;q=0.8"
}
payload = {"vr": int(item_number[0])}

session = requests.Session(headers=item_request_headers)

# Get form page
form_response = session.get(validation_url, params=payload) 

# parse form page; BeautifulSoup could do this for example
soup = BeautifulSoup(form_response.content)
callbackid = soup.select('input[name=__CALLBACKID]')[0]['value']

item_request_body = {
    "__SPSCEditMenu": "true",
    "MSOWebPartPage_PostbackSource": "",
    "MSOTlPn_SelectedWpId": "",
    "MSOTlPn_View": 0,
    "MSOTlPn_ShowSettings": "False",
    "MSOGallery_SelectedLibrary": "",
    "MSOGallery_FilterString": "",
    "MSOTlPn_Button": "none",
    "__EVENTTARGET": "",
    "__EVENTARGUMENT": "",
    "MSOAuthoringConsole_FormContext": "",
    "MSOAC_EditDuringWorkflow": "",
    "MSOSPWebPartManager_DisplayModeName": "Browse",
    "MSOWebPartPage_Shared": "",
    "MSOLayout_LayoutChanges": "",
    "MSOLayout_InDesignMode": "",
    "MSOSPWebPartManager_OldDisplayModeName": "Browse",
    "MSOSPWebPartManager_StartWebPartEditingName": "false",
    "__VIEWSTATE": viewstate,
    "keywords": "Search our site",
    "__CALLBACKID": callbackid,
    "__CALLBACKPARAM": "startvr"
}

item_url = 'http://www.example.com/EN/items/Pages/yourrates.aspx'

response = session.post(url=item_url, params=payload, data=item_request_body,
                        headers={'Referer': form_response.url})

会话处理标头(设置用户代理和接受参数),只有使用会话的POST请求时,我们才会添加引荐者标头。


非常有帮助,Martijn,谢谢!我还在努力解决问题,但是一旦我完成实施和测试解决方案,我一定会确认 :) - David K.
另外,你知道我应该如何对这种类型的内容进行编码吗?__CALLBACKID=ctl00%24SPWebPartManager1%24g_dbb9e9c7_fe1d_46df_8789_99a6c9db4b22 它在回调中给了我一个错误,可能是由于不寻常的百分号引起的。 - David K.
请包含解码值,将编码留给requests。例如,%24是一个编码的$ - Martijn Pieters
@DavidK.:robobrowser不是浏览器层。它是requests加上BeautifulSoup再加上一些处理表单的胶水代码。 - Martijn Pieters
啊,那听起来更好了!目前我只是在使用bs4和requests,但这可能是一个方便的包。 - David K.
显示剩余3条评论

1

与问题标题相关,但不完全符合发布者的情况——我想补充一条有用的提示到Martijn的回答中,其中包括一些通用的requests库建议来处理POST请求。

通过浏览器检查请求有效载荷(例如Chrome开发工具的网络选项卡)可以显示负载中某些键/字段的多个实例

Chrome请求有效载荷示例:

...
"ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": "AcceptedModified",
"ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": "InvoiceFullyDisputed",
"ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": "DisputedItemsClosed",
...

复制浏览器的请求并将其精确匹配到您的请求的有效负载/数据参数中将不起作用(或者至少不会得到您期望的结果...您仍然可能会收到200状态码响应)-- 它只会发送键/字段的最后一次出现的值。

请求数据/有效负载,这将不起作用(或者至少不会得到您期望的结果):

payload = {
   ...
   "ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": "AcceptedModified",
   "ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": "InvoiceFullyDisputed",
   "ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": "DisputedItemsClosed",
   ...
}
r = session.post(url, headers=headers, data=payload)

相反,您必须将这些多个键/字段的值放入列表中:

请求数据/有效载荷将起作用(或获得预期结果):

payload = {
   ...
   "ctl00$cphMain$ctlInvoiceStatuses$lbInvoiceStatus": ["AcceptedModified", "InvoiceFullyDisputed", "DisputedItemsClosed"],
   ...
}
r = session.post(url, headers=headers, data=payload)

...我花了几个小时才意识到这一点,深入研究ASP.NET网站机制,认为我需要在那里理解。不是的。所以,只是想节省别人的时间,希望能有所帮助。

感谢这个Stack Overflow问题帮助我认识到这一点。

注意:您可以通过查看响应对象(此处为r)上的r.request.body来检查发送的有效负载的确切内容。这就是我意识到我的有效负载缺少一些信息(即多个字段/键)的方式。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接