使用Python和BeautifulSoup无法爬取网页中的某些href链接

3
我目前正在使用Python 3.4和bs4爬取网页,以收集塞尔维亚在Rio2016中参加的比赛结果。因此,这个链接包含了她所打过的所有比赛结果的链接,例如 这个
接下来,我发现这个链接是像这样在HTML源代码中定位的:
<a href="/en/volleyball/women/7168-serbia-italy/post" ng-href="/en/volleyball/women/7168-serbia-italy/post">
    <span class="score ng-binding">3 - 0</span>
</a>

经过多次尝试,这个href="/en/volleyball/women/7168-serbia-italy/post"从未出现。然后我尝试运行以下代码以从网址获取所有href:

from bs4 import BeautifulSoup
import requests

Countryr = requests.get('http://rio2016.fivb.com/en/volleyball/women/teams/srb-serbia#wcbody_0_wcgridpadgridpad1_1_wcmenucontent_3_Schedule')
countrySoup = BeautifulSoup(Countryr.text)

for link in countrySoup.find_all('a'):
    print(link.get('href'))

然后发生了一件奇怪的事情。输出结果中完全没有href="/en/volleyball/women/7168-serbia-italy/post"
我发现这个href位于该网址的一个选项卡页面href="#scheduldedOver"中,由以下HTML代码控制:
<nav class="tabnav">
    <a href="#schedulded" ng-class="{selected: chosenStatus == 'Pre' }" ng-click="setStatus('Pre')" ng-href="#schedulded">Scheduled</a>
    <a href="#scheduldedLive" ng-class="{selected: chosenStatus == 'Live' }" ng-click="setStatus('Live')" ng-href="#scheduldedLive">Live</a>
    <a href="#scheduldedOver" class="selected" ng-class="{selected: chosenStatus == 'Over' }" ng-click="setStatus('Over')" ng-href="#scheduldedOver">Complete</a>
</nav>

那么,在一个选项卡页内如何使用BeautifulSoup获取href呢?

您在源代码中找不到该URL,因为数据来自不同的URL - http://rio2016.fivb.com/en/api/volley/matches/WOG2016/en/user/team/3017。要构建该URL,请查看源代码中的`data-serviceteammatches =` - akash karothiya
是的。这是因为你的HTML没有这个信息<a ng-href="{{match.Url}}">。 你可以通过print Countryr.text来查看它。上面评论中的链接是获取URL的方法。 - giaosudau
非常感谢!现在我已经得到了正确的URL,看起来它只是一个纯文本文件。这是否意味着beautifulsoup已经完成了它的任务,我需要使用其他搜索字符串函数来获取内部信息?我尝试了以下内容,似乎matchSoup只是一个没有任何类别分隔符的长字符串: `Matchr = requests.get('http://rio2016.fivb.com' + linkUrl) matchSoup = BeautifulSoup(Matchr.text)print(matchSoup.text)` - Benson
不需要在这里使用Beautifulsoup,我建议您使用yaml或json模块。 - akash karothiya
谢谢 @akashkarothiya!我现在可以得到我想要的链接了。 - Benson
@Benson 欢迎并祝你好运 - akash karothiya
2个回答

1
数据是动态创建的,如果您查看实际源代码,您可以看到Angularjs模板。您仍然可以通过模拟ajax调用来获取所有信息的json格式,在源代码中,您还可以看到类似于div的东西:
<div id="AngularPanel" class="main-wrapper" ng-app="fivb"
data-servicematchcenterbar="/en/api/volley/matches/341/en/user/lives"
data-serviceteammatches="/en/api/volley/matches/WOG2016/en/user/team/3017"
data-servicelabels="/en/api/labels/Volley/en" 
data-servicelive="/en/api/volley/matches/WOG2016/en/user/live/">

使用 data-servicematchcenterbar href 将为您提供所有信息:
from bs4 import BeautifulSoup
import requests
from urlparse import urljoin

r = requests.get('http://rio2016.fivb.com/en/volleyball/women/teams/srb-serbia#wcbody_0_wcgridpadgridpad1_1_wcmenucontent_3_Schedule')
soup = BeautifulSoup(r.content)

base = "http://rio2016.fivb.com/"

json = requests.get(urljoin(base, soup.select_one("#AngularPanel")["data-serviceteammatches"])).json()

在 JSON 中,你会看到类似以下的输出:

{"Id": 7168, "MatchNumber": "006", "TournamentCode": "WOG2016", "TournamentName": "Women's Olympic Games 2016",
        "TournamentGroupName": "", "Gender": "", "LocalDateTime": "2016-08-06T22:35:00",
        "UtcDateTime": "2016-08-07T01:35:00+00:00", "CalculatedMatchDate": "2016-08-07T03:35:00+02:00",
        "CalculatedMatchDateType": "user", "LocalDateTimeText": "August 06 2016",
        "Pool": {"Code": "B", "Name": "Pool B", "Url": "/en/volleyball/women/results and ranking/round1#anchorB"},
        "Round": 68,
        "Location": {"Arena": "Maracanãzinho", "City": "Maracanãzinho", "CityUrl": "", "Country": "Brazil"},
        "TeamA": {"Code": "SRB", "Name": "Serbia", "Url": "/en/volleyball/women/teams/srb-serbia",
                  "FlagUrl": "/~/media/flags/flag_SRB.png?h=60&w=60"},
        "TeamB": {"Code": "ITA", "Name": "Italy", "Url": "/en/volleyball/women/teams/ita-italy",
                  "FlagUrl": "/~/media/flags/flag_ITA.png?h=60&w=60"},
        "Url": "/en/volleyball/women/7168-serbia-italy/post", "TicketUrl": "", "Status": "Over", "MatchPointsA": 3,
        "MatchPointsB": 0, "Sets": [{"Number": 1, "PointsA": 27, "PointsB": 25, "Hours": 0, "Minutes": "28"},
                                    {"Number": 2, "PointsA": 25, "PointsB": 20, "Hours": 0, "Minutes": "25"},
                                    {"Number": 3, "PointsA": 25, "PointsB": 23, "Hours": 0, "Minutes": "27"}],
        "PoolRoundName": "Preliminary Round", "DayInfo": "Weekend Day",
        "WeekInfo": {"Number": 31, "Start": 7, "End": 13}, "LiveStreamUri": ""},

你可以从中解析出任何你需要的东西。

谢谢@padraic!非常清晰,现在我可以获取链接了。 - Benson

1
感谢您的帮助,现在我可以获得正确的URL。这对我来说是一个很好的学习经历。非常感谢 :)
from bs4 import BeautifulSoup
import requests

Countryr = requests.get('http://rio2016.fivb.com/en/volleyball/women/teams/srb-serbia#wcbody_0_wcgridpadgridpad1_1_wcmenucontent_3_Schedule')
countrySoup = BeautifulSoup(Countryr.text)

for link in countrySoup.find_all('div', {'id': 'AngularPanel'}):
    linkUrl = link.get('data-serviceteammatches')

json = requests.get('http://rio2016.fivb.com' + linkUrl).json()

for item in json:
    print(item.get('Url'))

输出:

/en/volleyball/women/7168-serbia-italy/post
/en/volleyball/women/7172-serbia-puerto rico/post
/en/volleyball/women/7177-usa-serbia/post
/en/volleyball/women/7181-china-serbia/post
/en/volleyball/women/7187-serbia-netherlands/post
/en/volleyball/women/7195-russia-serbia/post
/en/volleyball/women/7198-serbia-usa/post
/en/volleyball/women/7200-china-serbia/post

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接