BeautifulSoup内部HTML是什么？

Question

BeautifulSoup内部HTML是什么？

pythonhtmlbeautifulsoupinnerhtml

78

假设我有一个包含

的页面，我可以使用soup.find()轻松获取该div。

现在，我想要打印该

的全部innerhtml：也就是说，我需要一个字符串，其中包含所有的HTML标签和文本，就像在JavaScript中使用obj.innerHTML获取的字符串一样。这是否可能？

- Matteo Monti

8个回答

17

其中一个选项可以是使用类似这样的东西：

 innerhtml = "".join([str(x) for x in div_element.contents])

- peewhy

2

这还有一些其他问题。首先，它不会转义字符串元素中的HTML实体（例如大于号和小于号）。其次，它将写入注释的内容，但不包括注释标签本身。 - ChrisD

添加到@ChrisD评论中不使用此内容的另一个原因：在包含非ASCII字符的内容上，这将抛出UnicodeDecodeError。 - Anthon

16

给定一个类似于<div id="outer"><div id="inner">foobar</div></div>的BS4 soup元素，以下是可以用于以不同方式检索其HTML和文本的各种方法和属性，以及它们返回的示例。

InnerHTML：

inner_html = element.encode_contents()

'<div id="inner">foobar</div>'

外部HTML：

outer_html = str(element)

'<div id="outer"><div id="inner">foobar</div></div>'

OuterHTML（格式化后）：

pretty_outer_html = element.prettify()

'''<div id="outer">
 <div id="inner">
  foobar
 </div>
</div>'''

仅文本（使用.text）：

element_text = element.text

'foobar'

仅文本（使用.string）：

element_string = element.string

'foobar'

- Pikamander2

3

str(element)可以帮助您获取outerHTML，然后从外部HTML字符串中删除外部标记。

- Amir Saniyan

你如何从外部HTML字符串中删除外部标签？ - Oleg Yablokov

1

最简单的方法是使用children属性。

inner_html = soup.find('body').children

它将返回一个列表。因此，您可以使用简单的for循环获取完整代码。

for html in inner_html:
    print(html)

- Praveen Kumar

要获取内容，请使用以下代码："".join(map(str,soup.find('body').children)).strip() - Setop

1

我觉得只需要使用unicode（x）就可以了。对我来说起作用。编辑：这将为您提供外部HTML而不是内部。

- Michael Litvin

1

这将返回包括外部元素在内的div，而不仅仅是内容。 - Arany

你是正确的。现在先将这里留着，以便帮助其他人。 - Michael Litvin

1

如果我没有误解，您的意思是针对像这样的例子：

<div class="test">
    text in body
    <p>Hello World!</p>
</div>

输出应该像这样：

text in body
    <p>Hello World!</p>

这是你的答案：

''.join(map(str,tag.contents))

- BSimjoo

-4

只需文本, Beautiful Soup 4 `get_text()`

如果你只想要一个文档或标签内可读的文本，你可以使用get_text()方法。它将返回一个包含文档或标签下所有文本的Unicode字符串：

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')

soup.get_text()
'\nI linked to example.com\n'
soup.i.get_text()
'example.com'

你可以指定一个字符串来连接文本的各个部分：

soup.get_text("|")
'\nI linked to |example.com|\n'

你可以告诉Beautiful Soup去除每个文本块开头和结尾的空格：

soup.get_text("|", strip=True)
'I linked to|example.com'

但是在那个时候，您可能想要使用.stripped_strings生成器，然后自己处理文本：

[text for text in soup.stripped_strings]
# ['I linked to', 'example.com']

从Beautiful Soup 4.9.0版本开始，当使用lxml或html.parser时，<script>、<style>和<template>标签的内容不被视为'text'，因为这些标签不是页面中可见的人类内容的一部分。

请参考此处：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

- Y Y

1

这个问题曾经是一个不同的问题吗？ - Driftr95

@Driftr95 好久不见了，老实说我忘记了。 - Y Y

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ChrisD · Accepted Answer

简述

如果您需要一个 UTF-8 编码的字节字符串，请使用 BeautifulSoup 4 中的 element.encode_contents() 方法，如果您需要一个 Python Unicode 字符串，则使用 element.decode_contents()方法。例如，DOM 的 innerHTML 方法可能会像这样：

def innerHTML(element):
    """Returns the inner HTML of an element as a UTF-8 encoded bytestring"""
    return element.encode_contents()

这些函数目前还未在在线文档中，因此我会引用当前的函数定义和代码中的文档字符串。

`encode_contents` - 自4.0.4版本以来

def encode_contents(
    self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
    formatter="minimal"):
    """Renders the contents of this tag as a bytestring.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param encoding: The bytestring will be in this encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

另请参阅格式化程序文档。你最有可能使用formatter="minimal"（默认值）或formatter="html"（用于HTML实体），除非你希望以某种方式手动处理文本。

encode_contents返回一个已编码的字节串。如果你想要Python Unicode字符串，则应使用decode_contents。

`decode_contents` - 自4.0.1版本起

decode_contents与encode_contents做的事情相同，但返回的是Python Unicode字符串而不是已编码的字节串。

def decode_contents(self, indent_level=None,
                   eventual_encoding=DEFAULT_OUTPUT_ENCODING,
                   formatter="minimal"):
    """Renders the contents of this tag as a Unicode string.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param eventual_encoding: The tag is destined to be
       encoded into this encoding. This method is _not_
       responsible for performing that encoding. This information
       is passed in so that it can be substituted in if the
       document contains a <META> tag that mentions the document's
       encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

BeautifulSoup 3

BeautifulSoup 3没有上述功能，取而代之的是它有renderContents函数。

def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
                   prettyPrint=False, indentLevel=0):
    """Renders the contents of this tag as a string in the given
    encoding. If encoding is None, returns a Unicode string.."""

为了与BS3兼容，此函数已于BeautifulSoup 4 (在4.0.4中)重新添加。

BeautifulSoup内部HTML是什么？

简述

encode_contents - 自4.0.4版本以来

decode_contents - 自4.0.1版本起

BeautifulSoup 3

只需文本, Beautiful Soup 4 get_text()

`encode_contents` - 自4.0.4版本以来

`decode_contents` - 自4.0.1版本起

只需文本, Beautiful Soup 4 `get_text()`