UnicodeEncodeError: 'ascii'编解码器无法编码字符'\xe9' - -在使用urllib.request python3时

5
我正在编写一个脚本,用于访问链接列表并解析信息。对于大多数网站而言,它都能正常工作,但是某些网站上出现了问题,报错为“UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 13: ordinal not in range(128)”。它停在python3的urllib包中的client.py上。具体链接为:http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html。这里有很多类似的帖子,但是没有任何答案适用于我。我的代码如下:
from urllib import request

def __request(link,debug=0):      

try:
    html = request.urlopen(link, timeout=35).read() #made this long as I was getting lots of timeouts
    unicode_html = html.decode('utf-8','ignore')

# NOTE the except HTTPError must come first, otherwise except URLError will also catch an HTTPError.
except HTTPError as e:
    if debug:
        print('The server couldn\'t fulfill the request for ' + link)
        print('Error code: ', e.code)
    return ''
except URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('timeout')
        return ''    
else:
    return unicode_html

这将调用请求功能

链接 = 'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html' 页面 = __request链接

回溯是:

Traceback (most recent call last):
  File "<string>", line 250, in run_nodebug
  File "C:\reader\get_news.py", line 276, in <module>
    main()
  File "C:\reader\get_news.py", line 255, in main
    body = get_article_body(item['link'],debug=0)
  File "C:\reader\get_news.py", line 155, in get_article_body
    page = __request('na',url)
  File "C:\reader\get_news.py", line 50, in __request
    html = request.urlopen(link, timeout=35).read()
  File "C:\Python33\Lib\urllib\request.py", line 156, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python33\Lib\urllib\request.py", line 469, in open
    response = self._open(req, data)
  File "C:\Python33\Lib\urllib\request.py", line 487, in _open
    '_open', req)
  File "C:\Python33\Lib\urllib\request.py", line 447, in _call_chain
    result = func(*args)
  File "C:\Python33\Lib\urllib\request.py", line 1268, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "C:\Python33\Lib\urllib\request.py", line 1248, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "C:\Python33\Lib\http\client.py", line 1061, in request
    self._send_request(method, url, body, headers)
  File "C:\Python33\Lib\http\client.py", line 1089, in _send_request
    self.putrequest(method, url, **skips)
  File "C:\Python33\Lib\http\client.py", line 953, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 13: ordinal not in range(128)

非常感谢您的帮助,这让我很困扰。我认为我已经尝试了所有x.decode和类似的组合。

(如果可能的话,我可以忽略有问题的字符。)


2
用户Kenneth Reitz的requests库。我强烈推荐它。它将使所有这些代码变得更简单,并几乎肯定会解决此问题。 - JackGibbs
@JackGibbs:requests确实会通过显式重新引用URL来处理其中包含非ASCII字符的URL。 - Martijn Pieters
2个回答

5

使用百分号编码的URL

link = 'http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html'

我发现上述百分号编码的URL是通过将浏览器指向以下位置得到的:
http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html

前往该网页,然后将浏览器提供的编码URL复制并粘贴回文本编辑器中。但是,您可以使用以下方法以编程方式生成百分号编码的URL:

from urllib import parse

link = 'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html'

scheme, netloc, path, query, fragment = parse.urlsplit(link)
path = parse.quote(path)
link = parse.urlunsplit((scheme, netloc, path, query, fragment))

这将产生

http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html

如果它是URL的一部分,那么可以使用parse.quote()代替parse.quote_plus()(用于x-www-form-urlencoded)。 - jfs
谢谢,那个方法可行。我不确定它是否会影响URL的其他部分,所以我将其拆分然后重新构建。url_tuple =parse.urlsplit(link) parse.quote_plus(url_tuple[2]) + url_tuple[3] + parse.quote_plus(url_tuple[4])) encoded_link ="%s://%s%s?%s%s"%(url_tuple[0] , url_tuple[1] , parse.quote(url_tuple[2]) , url_tuple[3] , parse.quote(url_tuple[4])) - kender99
1
很高兴你解决了问题。但是使用parse.urlunsplit来构建URL,那是它的作用。 - unutbu

3

您的URL包含无法表示为ASCII字符的字符。

您需要确保所有字符都已正确进行URL编码; 例如使用 urllib.parse.quote_plus; 它将使用UTF-8 URL编码转义来表示任何非ASCII字符。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接