我尝试下载一个像这样的html文件:
import urllib
req = urllib.urlopen("http://www.stream-urls.de/webradio")
html = req.read()
print html
html = html.decode('utf-16')
print html
由于req.read()
之后的输出看起来像是unicode编码,我尝试进行转换,但出现了以下错误:
Traceback (most recent call last): File
"e:\Documents\Python\main.py", line 8, in <module>
html = html.decode('utf-16')
File "E:\Software\Python2.7\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 38-39: illegal UTF-16 surrogate
我需要做些什么来获取正确的编码方式?
urllib
和HTML无关。它只涉及字符编码问题,因此您可能希望重新表述并将问题最小化,以便专注于这个问题,仅此问题。 - barak manoscharset=utf-8
。你为什么要用utf-16解码? - Alex K.