我已经将网页下载为一个html文件。我想知道获取该页面内容的最简单方法是什么。所谓内容,指的是浏览器显示的字符串。
明确一下:
输入:
<html><head><title>Page title</title></head>
<body><p id="firstpara" align="center">This is paragraph <b>one</b>.
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
输出:
Page title This is paragraph one. This is paragraph two.
组合在一起:
from BeautifulSoup import BeautifulSoup
import re
def removeHtmlTags(page):
p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
return p.sub('', page)
def removeHtmlTags2(page):
soup = BeautifulSoup(page)
return ''.join(soup.findAll(text=True))