如何打开包含Unicode字符的HTML文件？

Question

如何打开包含Unicode字符的HTML文件？

49

我有一个名为test.html的html文件，它只包含一个单词בדיקה。

我使用以下代码块打开test.html并打印它的内容：

file = open("test.html", "r")
print file.read()

但它会打印出??????，为什么会这样以及我该如何修复？

顺便说一句，当我打开文本文件时它可以正常工作。

编辑：我已经尝试过这个：

>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????

- david

3

阅读有关Unicode和UTF-8的内容。 Unicode是一种数字编码系统，用于表示几乎所有语言中的字符和符号。而UTF-8是一种将Unicode字符编码为字节序列的方法，它支持所有Unicode字符并且适合在因特网上传输和存储文本数据。 - vks

2

您IP地址为143.198.54.68，由于运营成本限制，当前对于免费用户的使用频率限制为每个IP每72小时10次对话，如需解除限制，请点击左下角设置图标按钮（手机用户先点击左上角菜单按钮）。 - Tanveer Alam

如果仍然无法正常工作，请发布您尝试处理的页面。 - wenzul

8个回答

26

我今天也遇到了这个问题。我正在使用Windows，系统默认语言是中文。因此，其他人可能也会遇到这个Unicode错误。只需添加encoding='utf-8'：

with open("test.html", "r", encoding='utf-8') as f:
    text= f.read()

- Chen Mier

16

你可以使用以下代码：

from __future__ import division, unicode_literals 
import codecs
from bs4 import BeautifulSoup

f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print(document)

如果您想删除所有空行并将所有单词作为字符串获取（同时避免特殊字符和数字），则还需包括以下内容：

import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
    line = (line.rstrip())
    if line:
        if re.match("^[A-Za-z]*$",line):
            if (line not in stop and len(line)>1):
                st=st+" "+line
print st

将st定义为一个字符串，初始值为空字符串，如st=""

- Dibin Joseph

8

您可以使用 'urllib' 读取 HTML 页面。

 #python 2.x

  import urllib

  page = urllib.urlopen("your path ").read()
  print page

- Benjamin

我该如何对“page”执行操作？例如从中读取特定单词等。我能像字符串一样使用“page”吗？ - Sooraj

6

使用带有编码参数的codecs.open。

import codecs
f = codecs.open("test.html", 'r', 'utf-8')

- wenzul

1

代码：

import codecs

path="D:\\Users\\html\\abc.html" 
file=codecs.open(path,"rb")
file1=file.read()
file1=str(file1)

- SHUBHAM SINGH

0

您可以简单地使用它

import requests

requests.get(url)

- Ayemun Hossain Ashik

-2

在Python3中，您可以像https://dev59.com/P14d5IYBdhLWcg3wDe71#27243244一样使用'urllib'，只需进行少量更改即可。

#python3

import urllib

page = urllib.request.urlopen("/path/").read()
print(page)

- Suresh2692

AttributeError: 'module' object has no attribute 'request' - tommy.carstensen

@tommy.carstensen 也许你应该看一下这个 urllib python3。 - Suresh2692

1

谢谢。我对那份文档非常熟悉。缩进是错误的，应该是 import urllib.request。 - tommy.carstensen

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- vks · Accepted Answer

62

import codecs
f=codecs.open("test.html", 'r')
print f.read()

试试这样做。

- vks

2

我也尝试使用codecs.open("test.html",'r','utf-8')，但是当我打印f.read()时，会出现Unicode解码错误！ - david

我正在使用终端！！ - david

我遇到了这个错误：UnicodeDecodeError: 'utf8'编解码器无法解码第0个字节0xe1：无效的连续字节。 - david

导入 sys 打印(sys.stdout.encoding) UTF-8

- david

文件的编码不是UTF-8，而是Windows-1255！ - david