<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>
预期输出:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
我尝试使用html2text
模块,但并没有取得太大的成效:
#!/usr/bin/env python
import urllib2
import html2text
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())
txt = soup.find('div', {'class' : 'body'})
print(html2text.html2text(txt))
我想将txt
对象转换为文本,并在屏幕上打印出来。
lynx -dump filename.html
来完成此操作。http://lynx.browser.org/ 另外,您还可以使用XPath表达式和http://www.w3.org/Tools/HTML-XML-utils/。 - Dave Jarvis