Python urllib.urlopen IOError

Question

Python urllib.urlopen IOError

3

我在一个函数中有以下代码行：

sock = urllib.urlopen(url)
html = sock.read()
sock.close()

当我手动调用函数时，它们可以正常工作。但是，当我在循环中调用函数（使用与之前相同的URL）时，我会收到以下错误：

> Traceback (most recent call last):
  File "./headlines.py", line 256, in <module>
    main(argv[1:])
  File "./headlines.py", line 37, in main
    write_articles(headline, output_folder + "articles_" + term +"/")
  File "./headlines.py", line 232, in write_articles
    print get_blogs(headline, 5)
  File "/Users/michaelnussbaum08/Documents/College/Sophmore_Year/Quarter_2/Innovation/Headlines/_code/get_content.py", line 41, in get_blogs
    sock = urllib.urlopen(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 203, in open
    return getattr(self, name)(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib.py", line 314, in open_http
    if not host: raise IOError, ('http error', 'no host given')
IOError: [Errno http error] no host given

有什么想法吗？

编辑更多代码：

def get_blogs(term, num_results):
    search_term = term.replace(" ", "+")
    print "search_term: " + search_term
    url = 'http://blogsearch.google.com/blogsearch_feeds?hl=en&q='+search_term+'&ie=utf-8&num=10&output=rss'
    print "url: " +url  

    #error occurs on line below

    sock = urllib.urlopen(url)
    html = sock.read()
    sock.close()

def write_articles(headline, output_folder, num_articles=5):

    #calls get_blogs

    if not os.path.exists(output_folder):
    os.makedirs(output_folder)

    output_file = output_folder+headline.strip("\n")+".txt"
    f = open(output_file, 'a')
    articles = get_articles(headline, num_articles)
    blogs = get_blogs(headline, num_articles)


    #NEW FUNCTION
    #the loop that calls write_articles
    for term in trend_list: 
        if do_find_max == True:
        fill_search_term(term, output_folder)
    headlines = headline_process(term, output_folder, max_headlines, do_find_max)
    for headline in headlines:
    try:
        write_articles(headline, output_folder + "articles_" + term +"/")
    except UnicodeEncodeError:
        pass

- Michael

3个回答

1

在您的函数循环中，在调用urlopen之前，也许可以加上一个打印语句：

print(url)
sock = urllib.urlopen(url)

这样，当您运行脚本并出现 IOError 时，您将看到导致问题的 url。如果 url 等于类似于 'http://' 的内容，则可以复制错误“未给出主机”。...

- unutbu

是的，我尝试过了，例如一个url是"http://blogsearch.google.com/blogsearch_feeds?hl=en&q=Iceland+Pictures+Lightning+Adds+Flash&ie=utf-8&num=10&output=rss"。它们都是使用不同查询搜索Google博客搜索。如果我只在解释器中调用urlopen，或者调用整个生成url的函数，它就可以工作，但是当我在循环中调用它时就无法工作。 - Michael

它总是在相同的查询或不同的查询上崩溃吗？你是否在Web代理后面？ - MK.

1

如果您不想自己逐块处理读取，可以使用urllib2。这样做可能会达到您的预期。

import urllib2
req = urllib2.Request(url='http://stackoverflow.com/')
f = urllib2.urlopen(req)
print f.read()

- Eddy Pronk

5

好主意但不幸运，我得到了“urllib2.URLError：<urlopen error no host given>”的错误提示，这两个错误都表示“没有主机”，但我不知道为什么... - Michael

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user1994702 · Accepted Answer

当我将变量与URL连接时，我遇到了这个问题，在你的情况下是search_term

url = 'http://blogsearch.google.com/blogsearch_feeds?hl=en&q='+search_term+'&ie=utf-8&num=10&output=rss'

在结尾处有一个换行符，请确保你这么做。

search_term = search_term.strip()

您可能还想做

search_term = urllib2.quote(search_term)

确保您的字符串在URL中安全