BeautifulSoup（bs4）解析错误

Question

BeautifulSoup（bs4）解析错误

3

使用Python 2.7.6中的bs4解析此示例文档：

<html>
<body>
<p>HTML allows omitting P end-tags.

<p>Like that and this.

<p>And this, too.

<p>What happened?</p>

<p>And can we <p>nest a paragraph, too?</p></p>

</body>
</html>

使用：

from bs4 import BeautifulSoup as BS
...
tree = BS(fh)

长期以来，HTML允许省略各种元素类型的结束标记，包括P标签（请检查模式或解析器）。然而，bs4对于这个文档的prettify()函数显示，在遇到</body>标签之前，它不会结束任何一个段落：

<html>
 <body>
  <p>
   HTML allows omitting P end-tags.
   <p>
    Like that and this.
    <p>
     And this, too.
     <p>
      What happened?
     </p>
     <p>
      And can we
      <p>
       nest a paragraph, too?
      </p>
     </p>
    </p>
   </p>
  </p>
 </body>

并不是 prettify() 的问题，因为手动遍历树时得到的结构相同：

<[document]>
    <html>
        ␊
        <body>
            ␊
            <p>
                HTML allows omitting P end-tags.␊␊
                <p>
                    Like that and this.␊␊
                    <p>
                        And this, too.␊␊
                        <p>
                            What happened?
                        </p>
                        ␊
                        <p>
                            And can we 
                            <p>
                                nest a paragraph, too?
                            </p>
                        </p>
                        ␊
                    </p>
                </p>
            </p>
        </body>
        ␊
    </html>
    ␊
</[document]>

现在，这将是XML的正确结果（至少到</body>，此时应报告WF错误）。但这不是XML。为什么？

- TextGeek

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- TextGeek · Accepted Answer

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser上的文档介绍了如何让BS4使用不同的解析器。显然默认值是html.parse，BS4文档表示在Python 2.7.3之前它已经失效，但在2.7.6中仍存在上述问题。

对我来说切换到“lxml”没有成功，但切换到“html5lib”会产生正确的结果：

tree = BS(htmSource, "html5lib")