Python BeautifulSoup中的encode_contents和encode("utf-8")有何区别？

Question

Python BeautifulSoup中的encode_contents和encode("utf-8")有何区别？

pythonbeautifulsoupencode

4

好的，作为一名初学者的网络爬虫，我觉得当转换HTML文本的默认Unicode时，似乎两者都被使用并可以互换使用。我知道contents()是一个列表对象，但除此之外，它们有什么区别呢？

我注意到.encode("utf-8")似乎更普遍地适用。

谢谢,

-困惑的汤。

- SpicyClubSauce

2个回答

1

< p > encode_contents() 的方法签名表明，除了编码内容外，它还可以格式化输出：

encode_contents(self, indent_level=None, encoding='utf-8', formatter='minimal') method of bs4.BeautifulSoup instance
    Renders the contents of this tag as a bytestring.

例如：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><p>Caf\xe9</p></body></html>')
>>> soup.encode_contents()
'<html><body><p>Caf\xc3\xa9</p></body></html>'
>>> soup.encode_contents(indent_level=1)
'<html>\n <body>\n  <p>\n   Caf\xc3\xa9\n  </p>\n </body>\n</html>'
>>> soup.encode_contents(indent_level=1, encoding='iso-8859-1')
'<html>\n <body>\n  <p>\n   Caf\xe9\n  </p>\n </body>\n</html>'

str.encode('utf-8') 只能执行编码部分，不包括格式化。

- mhawke

嗯，没有指定参数的 .encode_contents() 和 .encode('utf-8') 是一样的吗？ - SpicyClubSauce

@SpicyClubSauce：是的。 - mhawke

@SpicyClubSauce：实际上我错了。我以为你是在指 str.encode() 而不是 soup.encode()。 - mhawke

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- salmanwahed · Accepted Answer

< p > encode_contents的文档：

encode_contents(self, indent_level=None, encoding='utf-8', formatter='minimal') method of bs4.BeautifulSoup instance
    Renders the contents of this tag as a bytestring.

encode方法的文档如下：

encode(self, encoding='utf-8', indent_level=None, formatter='minimal', errors='xmlcharrefreplace')

encode 方法适用于 bs4.BeautifulSoup 对象实例。 encode_contents 方法适用于 bs4.BeautifulSoup 实例的内容。

>>> html = "<div>div content <p> a paragraph </p></div>"
>>> soup = BeautifulSoup(html)
>>> soup.div.encode()
>>> '<div>div content <p> a paragraph </p></div>'
>>> soup.div.contents
>>> [u'div content ', <p> a paragraph </p>]
>>> soup.div.encode_contents()
>>> 'div content <p> a paragraph </p>'