Python中从文件夹读取HTML文件

Question

Python中从文件夹读取HTML文件

6

我希望能够在Python 3.4.3中读取HTML文件。

我尝试过以下方法：

import urllib.request
fname = r"C:\Python34\html.htm"
HtmlFile = open(fname,'w')
print (HtmlFile)

这将打印：

<_io.TextIOWrapper name='C:\\Python34\\html.htm' mode='w' encoding='cp1252'>

我想获取HTML源代码，以便使用beautiful soup进行解析。

- BLACKMAMBA

2

如果你想要读取，就不应该以写入模式打开它 ;) open(fname, 'w') => open(fname, 'r'). - m02ph3u5

2个回答

1

我想读取文件夹中保存的HTML文件。我尝试了Vikasa提到的代码，但出现了错误。因此，我更改了代码并再次尝试读取它，这对我起作用了。代码如下：

    fname = 'page_source.html' #this html file is stored on the same folder of the code file
    html_file = open(fname, 'r')
    source_code = html_file.read()

print the html page using

source_code

它将打印从page_source.html文件读取的内容。

- Yogesh Awdhut Gadade

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Vikas Ojha · Accepted Answer

13

您需要阅读文件的内容。

HtmlFile = open(fname, 'r', encoding='utf-8')
source_code = HtmlFile.read()

- Vikas Ojha

我在上面的代码行中遇到了以下错误：文件“C:/Python34/pretty.py”，第4行，<module>中 source_code = HtmlFile.read() 文件“C:\Python34\lib\encodings\cp1252.py”，第23行，在解码时出现问题 return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap'编解码器无法解码位置4411处的字节0x81：字符映射到<undefined>。 - BLACKMAMBA

1

使用编码来读取文件 - HtmlFile = open(fname, 'r', encoding='utf-8') - Vikas Ojha

2

记得在完成后关闭文件：HtmlFile.close() - dKen