处理 IncompleteRead 和 URLError

Question

处理 IncompleteRead 和 URLError

8

这是一段网络挖掘脚本。

def printer(q,missing):
    while 1:
        tmpurl=q.get()
        try:
            image=urllib2.urlopen(tmpurl).read()
        except httplib.HTTPException:
            missing.put(tmpurl)
            continue
        wf=open(tmpurl[-35:]+".jpg","wb")
        wf.write(image)
        wf.close()

q 是一个由 Url 组成的队列，missing 是一个空队列，用于收集发生错误的 Url。

它通过 10 个线程并行运行。

每次运行时，我都会得到以下结果。

  File "C:\Python27\lib\socket.py", line 351, in read
    data = self._sock.recv(rbufsize)
  File "C:\Python27\lib\httplib.py", line 541, in read
    return self._read_chunked(amt)
  File "C:\Python27\lib\httplib.py", line 592, in _read_chunked
    value.append(self._safe_read(amt))
  File "C:\Python27\lib\httplib.py", line 649, in _safe_read
    raise IncompleteRead(''.join(s), amt)
IncompleteRead: IncompleteRead(5274 bytes read, 2918 more expected)

但我确实使用了except...我尝试了其他一些东西，比如说

httplib.IncompleteRead
urllib2.URLError

甚至,

image=urllib2.urlopen(tmpurl,timeout=999999).read()

但这些都没有起作用...

我该如何捕获IncompleteRead和URLError？

- from __future__

有点晚了，但是在谷歌上第一个搜索结果。所以，https://dev59.com/gmYq5IYBdhLWcg3w2EFi#14206036 应该可以解决你的问题。顺便说一下，通常如果你想捕获多个异常，把它们放在一个元组中：except (httplib.IncompleteRead, urllib2.URLError)。 - Vincent Ketelaars

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Michael Leonard · Answer 1

我认为这个问题的正确答案取决于你认为什么是“引发错误的URL”。

捕获多个异常的方法

如果你认为任何引发异常的URL都应该被添加到missing队列中，那么可以这样做：

try:
    image=urllib2.urlopen(tmpurl).read()
except (httplib.HTTPException, httplib.IncompleteRead, urllib2.URLError):
    missing.put(tmpurl)
    continue

这将捕获这三个异常中的任何一个，并将该URL添加到missing队列中。更简单地说，您可以执行以下操作：

try:
    image=urllib2.urlopen(tmpurl).read()
except:
    missing.put(tmpurl)
    continue

捕捉任何异常，但这不被认为是Pythonic的，并且可能会隐藏您的代码中的其他可能错误。

如果“引发错误的URL”是指任何引发httplib.HTTPException错误的URL，但如果收到其他错误仍然想要继续处理，则可以执行以下操作：

try:
    image=urllib2.urlopen(tmpurl).read()
except httplib.HTTPException:
    missing.put(tmpurl)
    continue
except (httplib.IncompleteRead, urllib2.URLError):
    continue

如果它引发了httplib.HTTPException，那么它只会将URL添加到missing队列中，但是它会捕获httplib.IncompleteRead和urllib.URLError，并防止您的脚本崩溃。

遍历队列

顺便说一下，while 1循环总是让我有点担心。您可以使用以下模式循环遍历队列内容，虽然您可以继续按照自己的方式进行：

for tmpurl in iter(q, "STOP"):
    # rest of your code goes here
    pass

安全地处理文件

另外，除非绝对必要，否则应该使用上下文管理器来打开和修改文件。因此，您的三个文件操作行将变为：

with open(tmpurl[-35:]+".jpg","wb") as wf:
    wf.write()

上下文管理器负责关闭文件，即使在写入文件时发生异常，它也会执行关闭操作。