BeautifulSoup获取href

Question

BeautifulSoup获取href

340

我有以下这个soup：

<a href="some_url">next</a>
<span class="class">...</span>

我想提取href属性，其值为"some_url"

如果只有一个标签，我可以做到，但这里有两个标签。我也可以获取文本'next'，但那不是我想要的。

此外，是否有一个API的良好描述和示例。我正在使用标准文档，但我正在寻找更加组织化的内容。

- dkgirl

1

请发布一段代码示例，展示您尝试如何实现它。 - seb

4

好的，我明白了：soup.find('a')['href']让我感到困惑的是，我使用 Django（HTML）来查看它，实际上在呈现之前会删除 href：soup.find('a') 只剩下 'next'。 - dkgirl

1

没错，这个问题是一个重复的问题。但是@MarkLongair的答案之美使它变得珍贵，即使几年后也是如此。 - Giampaolo Ferradini

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mark Longair · Accepted Answer

你可以使用以下代码通过find_all方法找到每个带有href属性的标签，并将它们打印出来：

# Python2
from BeautifulSoup import BeautifulSoup
    
html = '''<a href="some_url">next</a>
<span class="class"><a href="another_url">later</a></span>'''
    
soup = BeautifulSoup(html)
    
for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

# The output would be:
# Found the URL: some_url
# Found the URL: another_url

# Python3
from bs4 import BeautifulSoup

html = '''<a href="https://some_url.com">next</a>
<span class="class">
<a href="https://some_other_url.com">another_url</a></span>'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print("Found the URL:", a['href'])

# The output would be:
# Found the URL: https://some_url.com
# Found the URL: https://some_other_url.com

请注意，如果你使用的是较早版本的BeautifulSoup（4.0 之前的版本），则此方法的名称为findAll。在版本 4 中，BeautifulSoup 的方法名称已更改以符合 PEP 8 标准，因此您应该改用 find_all。

如果您想要获取所有具有href属性的标签，可以省略 name 参数：

href_tags = soup.find_all(href=True)