urllib2 urlopen读取超时/阻塞问题

Question

urllib2 urlopen读取超时/阻塞问题

3

最近我正在开发一个小型爬虫，用于下载指定url上的图片。

我使用urllib2中的openurl()函数，并结合f.open()/f.write()进行操作：

以下是代码片段：

# the list for the images' urls
imglist = re.findall(regImg,pageHtml)

# iterate to download images
for index in xrange(1,len(imglist)+1):
    img = urllib2.urlopen(imglist[index-1])
    f = open(r'E:\OK\%s.jpg' % str(index), 'wb')
    print('To Read...')

    # potential timeout, may block for a long time
    # so I wonder whether there is any mechanism to enable retry when time exceeds a certain threshold
    f.write(img.read())
    f.close()
    print('Image %d is ready !' % index)

在上面的代码中，img.read() 可能会阻塞很长时间，我希望在这种情况下进行一些重试/重新打开图像 URL 操作。

我还关注以上代码的效率问题，如果要下载的图片数量较大，则使用线程池下载似乎更好。

有什么建议吗？提前感谢您。

附言：我发现 img 对象上的 read() 方法可能会导致阻塞，因此仅向 urlopen() 添加超时参数似乎是无用的。但我发现文件对象没有超时版本的 read()。对此有什么建议吗？非常感谢。

- destiny1020

4个回答

2

一个看起来很丑的 hack，但似乎可以工作。

import os, socket, threading, errno

def timeout_http_body_read(response, timeout = 60):
    def murha(resp):
        os.close(resp.fileno())
        resp.close()

    # set a timer to yank the carpet underneath the blocking read() by closing the os file descriptor
    t = threading.Timer(timeout, murha, (response,))
    try:
        t.start()
        body = response.read()
        t.cancel()
    except socket.error as se:
        if se.errno == errno.EBADF: # murha happened
            return (False, None)
        raise
    return (True, body)

- Anppa

1

当您使用urllib2.urlopen()创建连接时，可以提供超时参数。

如文档所述：

可选的超时参数指定阻塞操作（如连接尝试）的超时时间（如果未指定，则将使用全局默认超时设置）。实际上，这仅适用于HTTP、HTTPS和FTP连接。

通过这种方式，您将能够管理最长等待时间并捕获引发的异常。

- Cédric Julien

1

我发现img对象上的read()方法可能会导致阻塞，因此仅向urlopen()添加超时参数似乎是无用的。但我发现文件对象没有超时版本的read()。对此有什么建议吗？非常感谢。 - destiny1020

@destiny1020 你有没有找到解决这个问题的好方法？我在read()时遇到了阻塞，导致我的脚本挂起。 - CatShoes

如果在找到“read is block”问题的答案之前发现了此评论，请参见：https://dev59.com/80fRa4cB1Zd3GeqP82pK - CatShoes

1

我爬取大量文档的方法是使用批处理器，它会爬取并转储恒定大小的块。

假设您要爬取预先知道的一批文档，比如100K份。您可以编写一些逻辑来生成恒定大小的块，例如1000份文档，由线程池下载。一旦整个块被爬取完毕，您就可以在数据库中进行批量插入。然后继续处理下一个1000份文档，以此类推。

采用这种方法的优点有：

您可以利用线程池加速爬取速度。
它具有容错性，即您可以从上次失败的块处继续。
您可以根据优先级生成块，例如首先爬取重要文档。因此，如果无法完成整个批次，则会处理重要文档，并在下一次运行时处理不太重要的文档。

- Sushant Gupta

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Constantinius · Accepted Answer

urllib2.urlopen函数有一个timeout参数，用于所有阻塞操作（如连接建立等）。

以下代码摘自我的一个项目。我使用线程池同时下载多个文件。它使用urllib.urlretrieve函数，但逻辑相同。 url_and_path_list是一个由(url, path)元组组成的列表，num_concurrent是要生成的线程数，skip_existing跳过已存在于文件系统中的文件的下载。

def download_urls(url_and_path_list, num_concurrent, skip_existing):
    # prepare the queue
    queue = Queue.Queue()
    for url_and_path in url_and_path_list:
        queue.put(url_and_path)

    # start the requested number of download threads to download the files
    threads = []
    for _ in range(num_concurrent):
        t = DownloadThread(queue, skip_existing)
        t.daemon = True
        t.start()

    queue.join()

class DownloadThread(threading.Thread):
    def __init__(self, queue, skip_existing):
        super(DownloadThread, self).__init__()
        self.queue = queue
        self.skip_existing = skip_existing

    def run(self):
        while True:
            #grabs url from queue
            url, path = self.queue.get()

            if self.skip_existing and exists(path):
                # skip if requested
                self.queue.task_done()
                continue

            try:
                urllib.urlretrieve(url, path)
            except IOError:
                print "Error downloading url '%s'." % url

            #signals to queue job is done
            self.queue.task_done()