UnicodeDecodeError: 'utf8'编解码器无法解码第34个位置的0xc3字节：数据意外结束

Question

UnicodeDecodeError: 'utf8'编解码器无法解码第34个位置的0xc3字节：数据意外结束

15

我正在尝试编写一个“爬虫”，但是我遇到了编码问题。当我尝试将我要查找的字符串复制到我的文本文件中时，python2.7告诉我它无法识别编码，尽管没有特殊字符。不知道这是否有用。

我的代码看起来像这样：

from urllib import FancyURLopener
import os

class MyOpener(FancyURLopener): #spoofs a real browser on Window
   version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'

print "What is the webaddress?"
webaddress = raw_input("8::>")

print "Folder Name?"
foldername = raw_input("8::>")

if not os.path.exists(foldername):
    os.makedirs(foldername)

def urlpuller(start, page):
   while page[start]!= '"':
      start += 1
   close = start
   while page[close]!='"':
      close += 1
   return page[start:close]

myopener = MyOpener()

response = myopener.open(webaddress)
site = response.read()

nexturl = ''
counter = 0

while(nexturl!=webaddress):
   counter += 1
   start = 0
   
   for i in range(len(site)-35):
       if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
         start = i + 40
         break
   else:
      print "Something's broken, chief. Error = 1"
   
   next = 0
   
   for i in range(start, 8, -1):
      if site[i:i+8] == u'<a href=':
         next = i
         break
   else:
      print "Something's broken, chief. Error = 2"
   
   nexturl = urlpuller(next, site)
   
   myopener.retrieve(urlpuller(start,site),foldername+'/'+foldername+str(counter)+'.jpg')

print("Retrieval of "+foldername+" completed.")

当我尝试在我使用的网站上运行它时，它返回错误：

Traceback (most recent call last):
  File "yada/yadayada/Python/scraper.py", line 37, in <module>
    if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

当指向http://google.com时，它正常工作。

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

但是，当我尝试使用utf-8进行解码时，正如您所看到的，它并不起作用。您有什么建议吗？

- user3701032

使用像Beautiful Soup这样的HTTP解析器。读取和解码已经包含在内。 - Daniel

@Daniel，我已经阅读了文档，但是一旦我“打开”网站，我不清楚如何对其进行解码。 - user3701032

4个回答

3

在 Sublime 中打开 CSV 文件，然后选择 "Save with Encoding" -> UTF-8。

- ssareen

1

site[i:i+35].decode('utf-8', errors='ignore')

- Xiaobing Mi

忽略错误真的更好吗？您能解释一下吗？ - General Grievance

0

不要使用for循环，可以尝试以下方式：

start = site.decode('utf-8').find('<img id="imgSized" class="slideImg"') + 40

- Daniel

这是否解决了编码问题？ - user3701032

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martin Konecny · Accepted Answer

site[i:i+35].decode('utf-8')

你不能随意分割接收到的字节，然后让 UTF-8 解码它。UTF-8 是一种多字节编码，这意味着你可以使用 1 到 6 个字节表示一个字符。如果你将其二分之一，然后要求 Python 对其进行解码，它会抛出 unexpected end of data 错误。

寻找一个内置此功能的工具。 BeautifulSoup 或 lxml 是两个选择。