Python的requests库使用时缺少头部信息

3

我使用Python的(3.5.2)requests库(2.12.4)向Primer-BLAST网站发送查询。以下是我为此任务编写的脚本:

#!/usr/bin/env python

import requests

# BaseURL being accessed
url = 'https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi'

# Dictionary of query parameters
data = {
    'INPUT_SEQUENCE' : 'TCTTCTGAGAAAGTCTGAGGCTCCTTAGTACCTTCTCTAGTATGAACTGTTCAGCCTGCCCGCAAGTTGTAACTACGCAGGCGCCAAGACAGCCAACCAAGGAGGCTGCAGA',
    'ORGANISM'       : 'Mus musculus'
}

# Make a POST request and read the response
with requests.session() as session:
    poster = session.post(url, data=data)
    for key, value in poster.headers.items():
        print(key, ':', value)

我需要从响应头信息中检索NCBI-RCGI-RetryURL字段。然而,只有在使用Google Chrome的HTTP跟踪扩展时,我才能看到此字段。下面是使用Google Chrome进行POST和响应的完整跟踪:

POST https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi
Origin: https://www.ncbi.nlm.nih.gov
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36
Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryBflp51Ny9ReeA5A9
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Referer: https://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi?LINK_LOC=reset
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8
Cookie: sv-userdata-key-www.ncbi.nlm.nih.gov=G5KxXzyQ81U_vs1aHK_7XDWciF1B8AjjDUmDunVbhIZhZ4p4t_SVK4ASpbTT8iDSJVcxBH9oUAB3K2xNWjp3G0koYCloBlYuSxdoIGIkYzl2; ncbi_sid=0751457F9A561D01_0000SID; _ga=GA1.2.567134514.1503994317; _gid=GA1.2.654339584.1503994317; _gat=1; starnext=MYGwlsDWB2CmAeAXAXAbgK7RAewIYBM4lkAmAXgAcAnMAW1ioCMRcBnRAMgBYzm3FWsXFWAALDgEZymHAUkBOMgAYA7AFYpXFQDF5AQTUA2ACIBRFRKVXrN2xI4klygMJcSltQA59R0wGY/S1tg63t3Shp6JhZ2AFI/PQA5AHlE03i9PnZBYTEMlLSHcgB3UoA6aGBGMAqQWgqwUTKAc2wANwceajoGLMR81NMHQwie6P4BtIy+nJFxEhVRqL7J9ISZoTnVh08l3pj+lWcC9KON3NFYo5OHRQkuLiUOPyd5K2eJMnvH5/JPDWefjIADNcCBBM8eIgqOhYM81F83GpniMJH4SPJnosSFxnrtAoZ5Li/IoXp5DCpuE50RIpNxPlIAtxyNBcIgwG04Q8yDI8IQEJwuAiSBw1EDvk81DxPEo/KKEfIRUYyCRDIZRYtJbtaT81HcOIYnE9DAycUA=

HTTP/1.1 200 OK
Date: Tue, 29 Aug 2017 13:38:27 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Referrer-Policy: origin-when-cross-origin
Content-Security-Policy: upgrade-insecure-requests
Cache-Control: no-cache, no-store, max-age=0, private, must-revalidate
Expires: 0
NCBI-PHID: 0C421A7A9A56E5310000000000000001.m_2
NCBI-RCGI-RetryURL: https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi?ctg_time=1504013907&job_key=aWO2H68Wor6FhLSBueGQs8P6gYHu6Zqc7w
NCBI-SID: 0751457F9A561D01_0000SID
Pragma: no-cache
Access-Control-Allow-Methods: POST, GET, PUT, OPTIONS, PATCH, DELETE
Access-Control-Allow-Origin: https://www.ncbi.nlm.nih.gov
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: Origin,X-Accept-Charset,X-Accept,Content-Type,X-Requested-With,NCBI-SID,NCBI-PHID
Content-Type: text/html
Set-Cookie: ncbi_sid=0751457F9A561D01_0000SID; domain=.nih.gov; path=/; expires=Wed, 29 Aug 2018 13:38:27 GMT
Vary: Accept-Encoding
Content-Encoding: gzip
X-UA-Compatible: IE=Edge
X-XSS-Protection: 1; mode=block
Keep-Alive: timeout=1, max=9
Connection: Keep-Alive
Transfer-Encoding: chunked

以下是我从脚本中获取的所有头部信息:

Date : Tue, 29 Aug 2017 14:41:08 GMT
Server : Apache
Strict-Transport-Security : max-age=31536000; includeSubDomains; preload
Referrer-Policy : origin-when-cross-origin
Content-Security-Policy : upgrade-insecure-requests
Accept-Ranges : bytes
Vary : Accept-Encoding
Content-Encoding : gzip
X-UA-Compatible : IE=Edge
X-XSS-Protection : 1; mode=block
Content-Length : 2516
Keep-Alive : timeout=1, max=10
Connection : Keep-Alive
Content-Type : text/html

NCBI-RCGI-RetryURL字段很重要,因为它包含了我需要执行GET请求以检索结果的URL。

编辑:

根据Maurice Meyer的建议更新脚本:

#!/usr/bin/env python

import requests

# BaseURL being accessed
url = 'https://www.ncbi.nlm.nih.gov/tools/primer-blast/primertool.cgi'

# Dictionary of query parameters
data = {
    'INPUT_SEQUENCE' : 'TCTTCTGAGAAAGTCTGAGGCTCCTTAGTACCTTCTCTAGTATGAACTGTTCAGCCTGCCCGCAAGTTGTAACTACGCAGGCGCCAAGACAGCCAACCAAGGAGGCTGCAGA',
    'ORGANISM'       : 'Mus musculus'
}

# Extra headers
headers = {
    'Origin' : 'https://www.ncbi.nlm.nih.gov',
    'Upgrade-Insecure-Requests' : '1',
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36',
    'Content-Type' : 'multipart/form-data; boundary=----WebKitFormBoundaryBflp51Ny9ReeA5A9',
    'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Referer' : 'https://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi?LINK_LOC=reset',
    'Accept-Encoding' : 'gzip, deflate, br',
    'Accept-Language' : 'en-US,en;q=0.8',
    'Cookie' : 'sv-userdata-key-www.ncbi.nlm.nih.gov=G5KxXzyQ81U_vs1aHK_7XDWciF1B8AjjDUmDunVbhIZhZ4p4t_SVK4ASpbTT8iDSJVcxBH9oUAB3K2xNWjp3G0koYCloBlYuSxdoIGIkYzl2; ncbi_sid=0751457F9A561D01_0000SID; _ga=GA1.2.567134514.1503994317; _gid=GA1.2.654339584.1503994317; _gat=1; starnext=MYGwlsDWB2CmAeAXAXAbgK7RAewIYBM4lkAmAXgAcAnMAW1ioCMRcBnRAMgBYzm3FWsXFWAALDgEZymHAUkBOMgAYA7AFYpXFQDF5AQTUA2ACIBRFRKVXrN2xI4klygMJcSltQA59R0wGY/S1tg63t3Shp6JhZ2AFI/PQA5AHlE03i9PnZBYTEMlLSHcgB3UoA6aGBGMAqQWgqwUTKAc2wANwceajoGLMR81NMHQwie6P4BtIy+nJFxEhVRqL7J9ISZoTnVh08l3pj+lWcC9KON3NFYo5OHRQkuLiUOPyd5K2eJMnvH5/JPDWefjIADNcCBBM8eIgqOhYM81F83GpniMJH4SPJnosSFxnrtAoZ5Li/IoXp5DCpuE50RIpNxPlIAtxyNBcIgwG04Q8yDI8IQEJwuAiSBw1EDvk81DxPEo/KKEfIRUYyCRDIZRYtJbtaT81HcOIYnE9DAycUA='
}

# Make a POST request and read the response
with requests.session() as session:
    poster = session.post(url, data=data, headers=headers)
    for key, value in poster.headers.items():
        print(key, ':', value)

更新的输出仍然没有差异:

Date : Tue, 29 Aug 2017 15:05:27 GMT
Server : Apache
Strict-Transport-Security : max-age=31536000; includeSubDomains; preload
Referrer-Policy : origin-when-cross-origin
Content-Security-Policy : upgrade-insecure-requests
Accept-Ranges : bytes
Vary : Accept-Encoding
Content-Encoding : gzip
X-UA-Compatible : IE=Edge
X-XSS-Protection : 1; mode=block
Content-Length : 2517
Keep-Alive : timeout=1, max=10
Connection : Keep-Alive
Content-Type : text/html

发送与Chrome相同的标头(User-Agent,Content-Type,Referer等),那么您将收到与Chrome相同的标头。 - Maurice Meyer
@MauriceMeyer 尝试包含头文件(请参见我在原始问题中的编辑),但标题输出仍然没有任何区别。 - jma1991
@MauriceMeyer 对我来说,这使问题变得更糟。虽然默认设置下我得到了更多的头文件,但添加Chrome头文件将我的输出减少到与OP相同的输出。 - illright
从我得到的所有标题中,只有三个是以“NCBI”为前缀的:NCBI-PHIDNCBI-RCGI-JobStatusNCBI-SID。但没有重试URL。 - illright
@Leva7 在使用浏览器发送作业请求后,第一个跟踪输出只包含NCBI-RCGI-retryURL,然后下一个跟踪更新在作业实际运行时包括NCBI-RCGI-JobStatus。 - jma1991
显示剩余2条评论
1个回答

0

这两者之间的请求数据完全不同。

具体来说,是请求正文数据。因此,使用Python的requests库并不缺少头信息 - 缺少的是POST请求到服务器的信息。

你不能简单地复制和粘贴头部信息。

'Content-Type' : 'multipart/form-data; boundary=----WebKitFormBoundaryBflp51Ny9ReeA5A9',

或者只需像这样发布数据INPUT_SEQUENCEORGANISM-无论如何,您现在为ORGANISM拥有的数据明显是错误的- 简单扫视就可以发现应该是Mus musculus(taxid:10090)而不是Mus musculus
所以 - 您需要查看整个请求 - 标头和正文,然后创建一个包含服务器所需数据的请求。 查看一下,您缺少大量服务器需要响应的数据。
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="INPUT_SEQUENCE"

TCTTCTGAGAAAGTCTGAGGCTCCTTAGTACCTTCTCTAGTATGAACTGTTCAGCCTGCCCGCAAGTTGTAACTACGCAGGCGCCAAGACAGCCAACCAAGGAGGCTGCAGA
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SEQFILE"; filename=""
Content-Type: application/octet-stream


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER5_START"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER5_END"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER3_START"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER3_END"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_LEFT_INPUT"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_RIGHT_INPUT"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_PRODUCT_MIN"

70
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_PRODUCT_MAX"

1000
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_NUM_RETURN"

10
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_MIN_TM"

57.0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_OPT_TM"

60.0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_MAX_TM"

63.0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_MAX_DIFF_TM"

3
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_ON_SPLICE_SITE"

0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SPLICE_SITE_OVERLAP_5END"

7
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SPLICE_SITE_OVERLAP_3END"

4
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="MIN_INTRON_SIZE"

1000
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="MAX_INTRON_SIZE"

1000000
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SEARCH_SPECIFIC_PRIMER"

on
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="SEARCHMODE"

0
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="PRIMER_SPECIFICITY_DATABASE"

refseq_mrna
------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="CUSTOM_DB"


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="CUSTOMSEQFILE"; filename=""
Content-Type: application/octet-stream


------WebKitFormBoundaryJVAJqDi2cI4BTfmc
Content-Disposition: form-data; name="ORGANISM"

Mus musculus (taxid:10090)
------WebKitFormBoundaryJVAJqDi2cI4BTfmc

etc...

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接