Python Beautiful Soup .content Property

Question

Python Beautiful Soup .content Property

5

BeautifulSoup的.content是什么？我正在学习crummy.com的教程，但我不太理解.content的作用。我已经查看了论坛，但没有找到答案。看下面的代码...

from BeautifulSoup import BeautifulSoup
import re



doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
        '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
        '</html>']

soup = BeautifulSoup(''.join(doc))
print soup.contents[0].contents[0].contents[0].contents[0].name

我希望代码的最后一行打印出“body”，而不是...

  File "pe_ratio.py", line 29, in <module>
    print soup.contents[0].contents[0].contents[0].contents[0].name
  File "C:\Python27\lib\BeautifulSoup.py", line 473, in __getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
AttributeError: 'NavigableString' object has no attribute 'name'

.content 只关注于 html、head 和 title 吗？如果是，为什么会这样呢？

提前感谢您的帮助。

- Robert Birch

我怀疑上面的代码不起作用的原因是因为.content最初只涉及html、title和head，但不涉及body，因为它在html层次结构中处于不同的类中。后来，在教程中，crummy使用下面的代码打印了body，这让我怀疑body在不同的层次结构中。如果其他人看到这篇文章，重要的是要了解HTML结构。查看http://www.w3.org/TR/REC-html40/struct/global.html#h-7.5.1 - Robert Birch

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Games Brainiac · Accepted Answer

它只是给出标签内部的内容。让我用一个例子来演示：

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
head = soup.head

print head.contents

上述代码给了我一个列表，[<title>The Dormouse's story</title>]，因为它在head标签中。因此调用[0]会给你列表中的第一项。

你得到错误的原因是因为soup.contents[0].contents[0].contents[0].contents[0]返回的内容没有更多标签（因此没有属性）。它从您的代码中返回Page Title，因为第一个contents[0]给您HTML标记，第二个给您head标记。第三个标记导致title标记，第四个标记给出实际内容。因此，当您对其调用name时，它没有标记可以给您。

如果您想要打印正文，可以执行以下操作：

soup = BeautifulSoup(''.join(doc))
print soup.body

如果你想仅使用contents获取body，请使用以下代码：

soup = BeautifulSoup(''.join(doc))
print soup.contents[0].contents[1].name

使用[0]作为索引是无法获取到它的，因为body是在head之后的第二个元素。