如何使用Beautiful Soup从href和class中获取链接？

Question

如何使用Beautiful Soup从href和class中获取链接？

3

我正在编写一个脚本从网站下载多个FLAC音频文件，我使用Beautiful Soup获取FLAC链接，并使用urlopen下载链接。

我希望Beautiful Soup可以搜索以.flac结尾的链接（我不知道文件名，只知道扩展名，例如：一个文件是XXX.flac，另一个是YYY.flac）

FLAC文件的HTML在这里

<b><a class=location href="/soundtracks/index.php">Soundtracks</a><font class=location> &raquo </font><a href="/soundtracks/highquality/index.php">High Quality Game 
Soundtracks [FLAC]</a><font class=location> &raquo </font><a href="/soundtracks/highquality/Metal_Gear_20th_Anniversary/72">Metal Gear 20th Anniversary</a><font class=location> &raquo 01 Metal Gear 20 Years History -Past, Present, Future- Download</font></b><h1>Metal Gear 20th Anniversary Download Links:</h1><a style="font-size: 16px; font-weight:bold;" href="http://50.7.161.234/bks/94/245/Music/[029] MG 20th Anniversary [FLAC]/01 Metal Gear 20 Years History -Past, Present, Future-.flac">Metal Gear 20th Anniversary - 01 Metal Gear 20 Years History -Past, Present, Future-</a> <font face="Verdana" style="font-size: 16px;">Format: FLAC, Size: 76M</font><br> <font face="Verdana" style="font-size: 10px;"><b>Note: If the file starts playing in your browser window, try right-clicking and "Save Target As"</b></font><br>

我已经尝试查找id。t = soup.find(id="flac") 但是我没有得到任何相关结果。我对此一无所知，不知道该如何解决。

我该如何让BS搜索并找到文件链接，然后将该文件链接分配给一个变量呢？

import mechanize
import urllib, urllib2, re
from bs4 import BeautifulSoup
####MECHANIZE####
br = mechanize.Browser()
res = br.open("http://www.emuparadise.me/soundtracks/highquality/Metal_Gear_20th_Anniversary/72")
a = 2 #COUNTER FOR LOOP
br.follow_link(text_regex='Download', nr=a)
b = br.geturl() #GETS THE URL
print b


page = urllib2.urlopen(b).read()
soup = BeautifulSoup(page)
soup.prettify()
t = soup.find(id="")
print t

- RN_

我对Beautiful soup非常陌生，不知道有什么方法可以解决这个问题，我花了一个小时进行研究，但是没有找到任何相关的东西。非常抱歉。 - RN_

你意识到你展示的HTML中没有一个带有id="flac"的元素吗？ - Marcin

就像我说的，我是个新手。我该如何让它搜索“location”类？ - RN_

根据@Marcin的评论，“如果您为名为id的参数传递一个值，Beautiful Soup将针对每个标签的id属性进行过滤:”（http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments） - Kev

请参见：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class - Kev

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Kev · Accepted Answer

你的代码试图匹配那些链接到 flac 的锚点上不存在的 id 属性。

相反，使用一个正则表达式去匹配以 .flac 结尾的 href：

t = soup.find_all(href=re.compile(".flac$"))