在Python中使用urllib2时,HTTP基本身份验证似乎无法正常工作

9

我正在尝试使用urllib2下载一张被基本认证保护的网页。我的python版本是2.7,但我也尝试在另一台安装了python 2.5的电脑上操作,结果遇到了完全相同的问题。我尽可能地按照这个指南中给出的示例进行操作,产生了以下代码:

import urllib2

passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, "http://authenticationsite.com/', "protected", "password")
authhandler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(authhandler)

f = opener.open("http://authenticationsite.com/content.html")
print f.read()
f.close()

很遗憾,这个服务器不是我的,所以我无法分享详细信息;我已经在上下文中替换了它们。当我运行它时,会出现以下Traceback:

  File
"/usr/lib/python2.7/urllib2.py", line
397, in open
response = meth(req, response)   File "/usr/lib/python2.7/urllib2.py",
line 510, in http_response
'http', request, response, code, msg, hdrs)   File
"/usr/lib/python2.7/urllib2.py", line
435, in error
return self._call_chain(*args)   File "/usr/lib/python2.7/urllib2.py",
line 369, in _call_chain
result = func(*args)   File "/usr/lib/python2.7/urllib2.py", line
518, in http_error_default
raise HTTPError(req.get_full_url(), code,
msg, hdrs, fp) urllib2.HTTPError: HTTP
Error 401: Authorization Required

现在,有趣的是当我使用ngrep监控计算机上的tcp流量时:

ngrep host 74.125.224.49 interface:
wlan0 (192.168.1.0/255.255.255.0)
filter: (ip) and ( host 74.125.224.49
)
#### T 192.168.1.74:34366 -74.125.224.49:80 [AP]   GET /content.html
HTTP/1.1..Accept-Encoding:
identity..Host:
authenticationsite.com..Connection:
close..User-Agent:
Python-urllib/2.7.... 

## T 74.125.224.49:80 -192.168.1.74:34366 [AP]   HTTP/1.1 401 Authorization Required..Date: Sun, 27
Feb 2011 03:39:31 GMT..Server:
Apache/2.2.3 (Red
Hat)..WWW-Authenticate: Digest
realm="protected",
nonce="6NSgTzudBAA=ac585d1f7ae0632c4b90324aff5e39e0f1fc25
05", algorithm=MD5,
qop="auth"..Content-Length:
486..Connection: close..Content-Type: text/html;
charset=iso-8859-1....<!DOCTYPE HTML
PUBLIC "-//IETF//DTD HTML
2.0//EN">.<html><head>.<title>401 Authorization   
Required</title>.</head><body>.<h1>Authorization
Required</h1>.<p>This server could not
verify that you.are authorized to
access the document.requested.  Either
you supplied the wrong.credentials
(e.g., badpassword), or
your.browser doesn't understand how to
supply.the credentials
required.</p>.<hr>.<address>Apache/2.2.3
(Red Hat) Server at
authenticationsite.com Port
80</address>.</body></html>.  

####

看起来urllib2在得到初始的401错误后甚至没有尝试提供凭据就抛出了该异常。

为了进行比较,这里是我在Web浏览器中进行身份验证时ngrep的输出:

ngrep host 74.125.224.49 interface:
wlan0 (192.168.1.0/255.255.255.0)
filter: (ip) and ( host 74.125.224.49
)
#### T 192.168.1.74:36102 -74.125.224.49:80 [AP]   GET /content.html HTTP/1.1..Host:
authenticationsite.com..User-Agent:
Mozilla/5.0 (X11; U; Linux i686;
en-US; rv:1.9.2.12) Gecko/20101027
Firefox/3.6.12..Accept: text  
/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8..Accept-Language:
en-us,en;q=0.5..Accept-Encoding:
gzip,deflate..Accept-Charset:
ISO-8859-1,utf-8;q=0.7,*;q=0.7..Keep-Alive:
115..Connection: keep-   alive....  
## T 74.125.224.49:80 -192.168.1.74:36102 [AP]   HTTP/1.1 401 Authorization Required..Date: Sun, 27
Feb 2011 03:43:42 GMT..Server:
Apache/2.2.3 (Red
Hat)..WWW-Authenticate: Digest
realm="protected",
nonce="rKCfXjudBAA=0c1111321169e30f689520321dbcce37a1876b
be", algorithm=MD5,
qop="auth"..Content-Length:
486..Connection: close..Content-Type: text/html;
charset=iso-8859-1....<!DOCTYPE HTML
PUBLIC "-//IETF//DTD HTML
2.0//EN">.<html><head>.<title>401 Authorization   
Required</title>.</head><body>.<h1>Authorization
Required</h1>.<p>This server could not
verify that you.are authorized to
access the document.requested.  Either
you supplied the wrong.credentials
(e.g., badpassword), or
your.browser doesn't understand how to
supply.the credentials
required.</p>.<hr>.<address>Apache/2.2.3
(Red Hat) Server at
authenticationsite.com Port
80</address>.</body></html>.  

######### T 192.168.1.74:36103 -74.125.224.49:80 [AP]   GET /content.html HTTP/1.1..Host:
authenticationsite.com..User-Agent:
Mozilla/5.0 (X11; U; Linux i686;
en-US; rv:1.9.2.12) Gecko/20101027
Firefox/3.6.12..Accept: text  
/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8..Accept-Language:
en-us,en;q=0.5..Accept-Encoding:
gzip,deflate..Accept-Charset:
ISO-8859-1,utf-8;q=0.7,*;q=0.7..Keep-Alive:
115..Connection: keep-   alive..Authorization: Digest
username="protected",
realm="protected",
nonce="rKCfXjudBAA=0c1111199162342689520550dbcce37a1876bbe",
uri="/content.html", algorithm=   MD5,
response="3b65dadaa00e1d6a1892ffff49f9f325",
qop=auth, nc=00000001,
cnonce="7636125b7fde3d1b".... 

##

接下来是网站的内容。

我已经尝试了一段时间,但无法弄清楚我做错了什么。如果有人能帮我解决问题,我将非常感激!

3个回答

9

我认为这是由于以下原因造成的:

WWW-Authenticate: Digest

看起来该资源是使用摘要认证(Digest)而不是基本认证(Basic)。这意味着您应该使用urllib2.HTTPDigestAuthHandler

代码可能是这样的:

import urllib2

passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, "http://authenticationsite.com/", "protected", "password")

# use HTTPDigestAuthHandler instead here
authhandler = urllib2.HTTPDigestAuthHandler(passman)
opener = urllib2.build_opener(authhandler)

res = opener.open("http://authenticationsite.com/content.html")
print res.read()
res.close()

谢谢,你说得完全正确!我非常感激你的帮助! - foob
我在使用Python脚本抓取网站URL时遇到了问题,该脚本将提取包含PDF的所有网站。我在代理后面工作,当我第一次打开浏览器时,代理会要求输入用户名和密码。我能够通过浏览器查看该网站并从该网站下载PDF。但是,我无法通过Python代码完成此操作。我遇到的错误是:“urllib.error.HTTPError:HTTP Error 401:Authorization Required”我收到的错误是:“AbstractDigestAuthHandler不支持以下方案:'Negotiate'”我是否漏掉了什么? - Bonson

0

你需要使用Python NTLM模块来完成这个任务:

from ntlm import HTTPNtlmAuthHandler

import urllib2

user = "你的用户名"

password = "你的密码"

passman = urllib2.HTTPPasswordMgrWithDefaultRealm()

passman.add_password(None, "http://你的主页位置/", user, password)

auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman)

opener = urllib2.build_opener(auth_NTLM)

urllib2.install_opener(opener)

url = "http://Your_home_location/sub_locations"

response = urllib2.urlopen(url)

headers = response.info()

print("headers: {}".format(headers))

body = response.read()

print("response: " + body)

这段代码涉及编程相关内容。

其中,第一行代码定义了一个URL变量。链接中的"Your_home_location/sub_locations"是一个占位符,可以被实际的路径替换。

第二行代码使用urllib2模块的urlopen()函数打开URL。该函数返回一个response对象。

接下来一行代码获取response对象的header信息,并赋值给headers变量。

第四行代码使用print语句输出headers变量的内容。

第五行代码调用response对象的read()函数,读取响应体内容,并将内容赋值给变量body。

最后一行代码使用print语句输出变量body的内容。


-1
import urllib2
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
                          uri='https://mahler:8092/site-updates.py',
                          user='klem',
                          passwd='kadidd!ehopper')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
urllib2.urlopen('http://www.example.com/login.html')

-- http://docs.python.org/library/urllib2.html#examples

(注:该文本为程序相关内容)

这基本上就是我已经在做的事情。正如 Victor Lin 在另一个答案中指出的那样,问题在于服务器实际上使用的是摘要认证(Digest Authentication),而不是基本认证(Basic Authentication)。 - foob

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接