漂亮汤嵌套div递归获取文本

4

我需要获取嵌套的div中的数据,但是我无法得到它。

有嵌套的div,我需要对数据进行适当的格式化。

我已经编写了bs4模块,但是遇到了错误。

BeautifulSoup: AttributeError: 'NavigableString'对象没有属性'name'

请帮助我!

我的HTML

<div id="new">
    <div id="newDat">
        <div class="Data">
            <div class="DataNew">
                <div class="DataNew new">
                    <div class="Data Left">
                        <div class="name"><a class="name" href="">Jack Daniels</a></div>
                        <div class="details"><span class="loc">Barcelona</span></div>
                        <div class="header"><a class="looking"> Looking for meeting new people</a></div>
                        <div class="ideas"><a class="ideas">I have new ideas</a></div>
                        <div class="profile"> <em class="profilss"></em>MS in cs<br></div>

                    </div>
                    <div class="Data Right">
                        <a class="phone"><span class="txt">+123123123123123231</span></a>
                    </div>
                </div>

            </div>
        </div>
        <div class="DataOne">
            <div class="DataNew">
                <div class="DataNew new">
                    <div class="Data Left">
                        <div class="name"><a class="name" href="">Jack Daniels</a></div>
                        <div class="details"><span class="loc">Barcelona</span></div>
                        <div class="header"><a class="looking"> Looking for meeting new people</a></div>
                        <div class="ideas"><a class="ideas">I have new ideas</a></div>
                        <div class="profile"> <em class="profilss"></em>MS in cs<br></div>

                    </div>
                    <div class="Data Right">
                        <a class="phone"><span class="txt">+123123123123123231</span></a>
                    </div>
                </div>

            </div>
        </div>
        <div class="DataTwo">
            <div class="DataNew">
                <div class="DataNew new">
                    <div class="Data Left">
                        <div class="name"><a class="name" href="">Jack Daniels</a></div>
                        <div class="details"><span class="loc">Barcelona</span></div>
                        <div class="header"><a class="looking"> Looking for meeting new people</a></div>
                        <div class="ideas"><a class="ideas">I have new ideas</a></div>
                        <div class="profile"> <em class="profilss"></em>MS in cs<br></div>

                    </div>
                    <div class="Data Right">
                        <a class="phone"><span class="txt">+123123123123123231</span></a>
                    </div>
                </div>  
            </div>
        </div>
        <div class="DataThree">
            <div class="DataNew">
                <div class="DataNew new">
                    <div class="Data Left">
                        <div class="name"><a class="name" href="">Jack Daniels</a></div>
                        <div class="details"><span class="loc">Barcelona</span></div>
                        <div class="header"><a class="looking"> Looking for meeting new people</a></div>
                        <div class="ideas"><a class="ideas">I have new ideas</a></div>
                        <div class="profile"> <em class="profilss"></em>MS in cs<br></div>

                    </div>
                    <div class="Data Right">
                        <a class="phone"><span class="txt">+123123123123123231</span></a>
                    </div>
                </div>

            </div>
        </div>
    </div>
</div>

我的美丽汤代码

    li = page.find('div', {'id': 'new'})
    for tag in li:
        for i in tag.find_all("div", {"class": "name"}):
            print i.getText()
            break

        for i in tag.find_all("div", {"class": "details"}):
            print i.getText()
            break

        for i in tag.find_all("div", {"class": "header"}):
            print i.getText()
            break


        for i in tag.find_all("div", {"class": "ideas"}):
            print i.getText()
            break


        for i in tag.find_all("div", {"class": "profile"}):
            print i.getText()
            break

        for i in tag.find_all("div", {"class": "phone"}):
            print i.getText()
            break

我希望输出的结果是这样的。
Div one 
Name : Jack Daniels
Details : Barcelona
header : Looking for meeting new people
ideas : I have new ideas
profile: MS in cs
tel : +123123123123123231

Div two 
Name : Jack Daniels
Details : Barcelona
header : Looking for meeting new people
ideas : I have new ideas
profile: MS in cs
tel : +123123123123123231

等等。

如果我在<div id = "new">内有100个

,我需要得到如下输出结果。


为什么你在第一次迭代后都要加上 break 的循环呢?你可以直接使用 find,例如:tag.find("div", {"class": "name"}).text - t.m.adam
谢谢@t.m.adam,我已经尝试过了,但我需要逐个div获取内容。 - user8856212
1个回答

1
你可以这样做。这将返回每个div的数据。
from bs4 import BeautifulSoup
soup = BeautifulSoup(b) // b is html 
rows =soup.find_all('div', {'class': 'DataNew'})
for tag in rows:
    for tag in li:
    for i in tag.find_all("div", {"class": "name"}):
        print i.getText()
        break

    for i in tag.find_all("div", {"class": "details"}):
        print i.getText()
        break

    for i in tag.find_all("div", {"class": "header"}):
        print i.getText()
        break


    for i in tag.find_all("div", {"class": "ideas"}):
        print i.getText()
        break


    for i in tag.find_all("div", {"class": "profile"}):
        print i.getText()
        break

    for i in tag.find_all("div", {"class": "Data Right"}):
        print i.getText()
        break

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接