使用Python urllib2下载文件，如何检查文件大小？

Question

使用Python urllib2下载文件，如何检查文件大小？

9

如果文件很大，那就停止下载吧？我不想下载超过12MB的文件。

request = urllib2.Request(ep_url)
request.add_header('User-Agent',random.choice(agents))
thefile = urllib2.urlopen(request).read()

- TIMEX

4个回答

7

你可以这样说：

maxlength= 12*1024*1024
thefile= urllib2.urlopen(request).read(maxlength+1)
if len(thefile)==maxlength+1:
    raise ThrowToysOutOfPramException()

但是，当然，您仍然读取了12MB的不需要的数据。如果您想最小化这种情况发生的风险，可以检查HTTP Content-Length头（如果存在）。但是，要做到这一点，您需要降级到更通用的urllib而非httplib。

u= urlparse.urlparse(ep_url)
cn= httplib.HTTPConnection(u.netloc)
cn.request('GET', u.path, headers= {'User-Agent': ua})
r= cn.getresponse()

try:
    l= int(r.getheader('Content-Length', '0'))
except ValueError:
    l= 0
if l>maxlength:
    raise IAmCrossException()

thefile= r.read(maxlength+1)
if len(thefile)==maxlength+1:
    raise IAmStillCrossException()

您可以在请求文件之前检查其长度，如果您愿意的话。这与上面的方法基本相同，只是使用'HEAD'方法而不是'GET'方法。

- bobince

1

这是一个更好的解决方案，因为Content-Length不可靠（有人可能会错误地设置它）。 - Taha Jahangir

1

如果设置了Content-Length头，则此方法将起作用。

import urllib2          
req = urllib2.urlopen("http://example.com/file.zip")
total_size = int(req.info().getheader('Content-Length'))

- Gourneau

你不需要使用 .strip()：1. getheader() 已经返回了去除空格的版本 2. int() 不关心前导/尾随空格。 - jfs

另外，如果您不设置默认值，则使用int(info().getheader())没有意义：从int引发的ValueError比从req.headers引发的KeyError不太合适（注意：req.info() is req.headers）。 - jfs

@Gourneau - 如果指定的URL是ftp://类型，这个方法还有效吗？ - Pankaj Parashar

@PankajParashar 不，"Content-Length" 是从HTTP头中提取出来的，所以只适用于HTTP。不过这可能是你需要的 https://dev59.com/iXA75IYBdhLWcg3wi5wr#5241914 - Gourneau

1

你可以先在 HEAD 请求中检查 content-length，但要注意，这个头部不一定被设置 - 参见如何在 Python 2 中发送 HEAD HTTP 请求？

- SeriousCallersOnly

我该如何在HEAD请求中检查内容长度？这是否被视为下载标头？ - TIMEX

如果你想使用urllib/urllib2，那么进行HEAD请求最多只是理论上的。这些模块仅支持GET和POST请求。 - Andrew Dalke

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andrew Dalke · Accepted Answer

不需要像bobince那样放弃使用httplib。你可以直接使用urllib完成所有操作：

>>> import urllib2
>>> f = urllib2.urlopen("http://dalkescientific.com")
>>> f.headers.items()
[('content-length', '7535'), ('accept-ranges', 'bytes'), ('server', 'Apache/2.2.14'),
 ('last-modified', 'Sun, 09 Mar 2008 00:27:43 GMT'), ('connection', 'close'),
 ('etag', '"19fa87-1d6f-447f627da7dc0"'), ('date', 'Wed, 28 Oct 2009 19:59:10 GMT'),
 ('content-type', 'text/html')]
>>> f.headers["Content-Length"]
'7535'
>>>

如果您使用httplib，则可能需要实现重定向处理、代理支持以及urllib2为您提供的其他好处。