使用gevent和requests异步模块时出现ImportError错误

Question

使用gevent和requests异步模块时出现ImportError错误

4

我正在编写一个简单的脚本，它会执行以下操作：

加载一个大量的URL列表
使用requests' async模块并发进行HTTP请求来获取每个URL的内容
使用lxml解析页面内容以检查页面上是否存在链接
如果页面上存在该链接，则将页面的一些信息保存到ZODB数据库中

当我使用4或5个URL测试脚本时，它可以正常工作。当脚本结束时，只会出现以下消息：

 Exception KeyError: KeyError(45989520,) in <module 'threading' from '/usr/lib/python2.7/threading.pyc'> ignored

当我尝试检查大约24000个URL时，到列表末尾时（还剩下大约400个URL需要检查），会出现以下错误：

Traceback (most recent call last):
  File "check.py", line 95, in <module>
  File "/home/alex/code/.virtualenvs/linka/local/lib/python2.7/site-packages/requests/async.py", line 83, in map
  File "/home/alex/code/.virtualenvs/linka/local/lib/python2.7/site-packages/gevent-1.0b2-py2.7-linux-x86_64.egg/gevent/greenlet.py", line 405, in joinall
ImportError: No module named queue
Exception KeyError: KeyError(45989520,) in <module 'threading' from '/usr/lib/python2.7/threading.pyc'> ignored

我尝试了可在pypi上获得的gevent版本，以及从gevent repository下载并安装最新版本（1.0b2）。

我不明白为什么会发生这种情况，而且只有在检查一堆URL时才会发生。有什么建议吗？

以下是整个脚本：

from requests import async, defaults
from lxml import html
from urlparse import urlsplit
from gevent import monkey
from BeautifulSoup import UnicodeDammit
from ZODB.FileStorage import FileStorage
from ZODB.DB import DB
import transaction
import persistent
import random

storage = FileStorage('Data.fs')
db = DB(storage)
connection = db.open()
root = connection.root()
monkey.patch_all()
defaults.defaults['base_headers']['User-Agent'] = "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
defaults.defaults['max_retries'] = 10


def save_data(source, target, anchor):
    root[source] = persistent.mapping.PersistentMapping(dict(target=target, anchor=anchor))
    transaction.commit()


def decode_html(html_string):
    converted = UnicodeDammit(html_string, isHTML=True)
    if not converted.unicode:
        raise UnicodeDecodeError(
            "Failed to detect encoding, tried [%s]",
            ', '.join(converted.triedEncodings))
    # print converted.originalEncoding
    return converted.unicode


def find_link(html_doc, url):
    decoded = decode_html(html_doc)
    doc = html.document_fromstring(decoded.encode('utf-8'))
    for element, attribute, link, pos in doc.iterlinks():
        if attribute == "href" and link.startswith('http'):
            netloc = urlsplit(link).netloc
            if "example.org" in netloc:
                return (url, link, element.text_content().strip())
    else:
        return False


def check(response):
    if response.status_code == 200:
        html_doc = response.content
        result = find_link(html_doc, response.url)
        if result:
            source, target, anchor = result
            # print "Source: %s" % source
            # print "Target: %s" % target
            # print "Anchor: %s" % anchor
            # print
            save_data(source, target, anchor)
    global todo
    todo = todo -1
    print todo

def load_urls(fname):
    with open(fname) as fh:
        urls = set([url.strip() for url in fh.readlines()])
        urls = list(urls)
        random.shuffle(urls)
        return urls

if __name__ == "__main__":

    urls = load_urls('urls.txt')
    rs = []
    todo = len(urls)
    print "Ready to analyze %s pages" % len(urls)
    for url in urls:
        rs.append(async.get(url, hooks=dict(response=check), timeout=10.0))
    responses = async.map(rs, size=100)
    print "DONE."

- raben

你尝试过进行调试以获取更多有关脚本失败时状态的信息吗？它总是相同的URL吗？（捕获异常并记录URL）这是内存问题吗？（查看执行期间的内存使用情况）？ - Jasper van den Bosch

3个回答

0

虽然我是个大菜鸟，但是无论如何，我还是可以尝试一下...！我猜你可以尝试使用这个导入列表：

from requests import async, defaults
import requests
from lxml import html
from urlparse import urlsplit
from gevent import monkey
import gevent
from BeautifulSoup import UnicodeDammit
from ZODB.FileStorage import FileStorage
from ZODB.DB import DB
import transaction
import persistent
import random

试一下这个，告诉我是否有效.. 我猜这可以解决你的问题 :)

- Carto_

如果尝试了我的解决方案问题仍然存在，也许这个链接可以帮助：http://www.daniweb.com/software-development/python/threads/251918/import-queue-dont-exist - Carto_

谢谢，我会试一试。但是您为什么认为这会解决我的问题呢？ - raben

因为我曾经遇到过同样的问题，这个方法对我有效。虽然我仍然不明白为什么……但无论如何，测试是免费的 :-) - Carto_

0

你好。我认为这是一个带有编号 Issue1596321 的开放式 Python Bug http://bugs.python.org/issue1596321

- Dmitry Zagorulkin

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Denis · Accepted Answer

我不确定你的问题源于何处，但为什么你把 monkey.patch_all() 放在文件顶部之外呢？

你可以尝试把它放在文件顶部。

from gevent import monkey; monkey.patch_all()

在你的主程序顶部尝试添加这个代码，看是否有改善？