使用Beautiful Soup从非class部分获取数据

Question

使用Beautiful Soup从非class部分获取数据

pythonparsingpython-2.7html-parsingbeautifulsoup

3

我还是一个很新手的人，正在学习Python和Beautiful Soup。我在如何从非class的HTML中获取文本方面遇到了困难。

这是我正在使用的HTML片段:

<section class="userbody">
    <script type="text/javascript"></script>
    <figure class="iw">
        <div id="ci">
            <img id="iwi" title="image 2" alt="" src="http://images.craigslist.org/00C0C_daJm4U9yU5B_600x450.jpg" style="min-width: inherit; min-height: 450px;"></img>
        </div>
        <div id="thumbs"></div>
    </figure>
    <div class="mapAndAttrs">
        <div class="mapbox">
            <div id="map" class="leaflet-container leaflet-fade-anim" data-longitude="-84.072447" data-latitude="33.908534" tabindex="0">
                <div class="leaflet-map-pane" style="transform: translate(0px, 0px);"></div>
                <div class="leaflet-control-container">
                    <div class="leaflet-top leaflet-left"></div>
                    <div class="leaflet-top leaflet-right"></div>
                    <div class="leaflet-bottom leaflet-left"></div>
                    <div class="leaflet-bottom leaflet-right">
                        <div class="leaflet-control-attribution leaflet-control"></div>
                    </div>
                </div>
            </div>
            <div class="mapaddress">

                Some Address

            </div>
        </div>
        <div class="attributes"></div>
    </div>
    <section id="postingbody">
            some posting info
            <br></br>
             more posting info
             <br></br>
    </section>
    <section class="cltags"></section>
    <div class="postinginfos"></div>
</section>

我已经可以提取地址信息：

     for address in soup.findAll("div", { "class" : "mapaddress" }):
       addressText = ''.join(address.findAll(text=True))

看起来 findAll() 对于那些没有类的标签不起作用，就像我在尝试中所做的一样

     for post in soup.findall("section", { "id" : "postingbody" }):
       postText = ''.join(post.findAll(text=True))

如何获取id为"postingbody"的部分中的文本？

- Jared

谢谢大家。这个社区太棒了！！ - Jared

3个回答

1

除了Games Brainiac的答案之外：要获取文本，只需在其后加上 .text。

因此：

print soup.find(attrs={'id' : 'postingbody'}).text

- SergioP

谢谢@SergioP。你知道我要说什么 :) - Jared

1

如果您正在使用BeautifulSoup4，可以这样做：

element = soup.find(id="postingbody")

- LuRsT

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Games Brainiac · Accepted Answer

好的，你可以按照以下步骤操作，考虑到s是HTML字符串：

from bs4 import BeautifulSoup

soup = BeautifulSoup(s)
print soup.find(attrs={'id' : 'postingbody'})

输出：

<section id="postingbody">
            some posting info
            <br/>
             more posting info
             <br/>
</section>