使用Beautiful Soup从“img”标记中提取“src”属性

Question

使用Beautiful Soup从“img”标记中提取“src”属性

70

请考虑：

<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>

我希望从一个图像（即img）标记中提取源（即src ）属性，使用Beautiful Soup。我使用的是Beautiful Soup 4，无法使用a.attrs [ 'src'] 来获取 src ，但可以获取 href 。我该怎么办？

- iDelusion

你为什么会期望 a.attrs['src'] 能够工作呢？在你展示的代码片段中并没有带有 src 属性的 <a> 标签。 - jwodder

2

这是一个完全不同于之前的问题，而且现在的标题毫无意义。 - patrick

@patrick 我使用正则表达式获取了 src。还有其他问题吗？ - iDelusion

@jwodder 我看到了，但是当我使用 img.attrs['src'] 时也出错了。但后来我使用正则表达式得到了我想要的内容。 - iDelusion

可能是Python Beautifulsoup img标签解析的重复问题。 - Abu Shoeb

4个回答

20

链接没有src属性。您必须针对实际的img标签。

import bs4

html = """<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>"""

soup = bs4.BeautifulSoup(html, "html.parser")

# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']

>>> 'some'

# if you have more then one 'a' tag
for a in soup.find_all('a'):
    if a.img:
        print(a.img['src'])

>>> 'some'

- mx0

8

以下是一种解决方法，即使img标签没有src属性也不会触发KeyError异常：

from urllib.request import urlopen
from bs4 import BeautifulSoup

site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')

images = bs.find_all('img')
for img in images:
    if img.has_attr('src'):
        print(img['src'])

- blastoise

“KeyError”：是一个异常吗？ - Peter Mortensen

6

你可以使用Beautiful Soup来提取HTML img标签的src属性。在我的示例中，htmlText包含了img标签本身，但是它也可以与urllib2一起用于URL。 Abu Shoeb的答案提供的解决方案在Python 3上不再起作用。以下是正确的实现方式： 对于URLs

from bs4 import BeautifulSoup as BSHTML
import urllib3

http = urllib3.PoolManager()
url = 'your_url'

response = http.request('GET', url)
soup = BSHTML(response.data, "html.parser")
images = soup.findAll('img')

for image in images:
    # Print image source
    print(image['src'])
    # Print alternate text
    print(image['alt'])

对于包含“img”标签的文本

from bs4 import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print(image['src'])

- Gray

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Abu Shoeb · Accepted Answer

你可以使用Beautiful Soup提取HTML img标签的src属性。在我的示例中，htmlText包含img标签本身，但也可以与urllib2一起用于URL。

对于URL：

from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
    # Print image source
    print(image['src'])
    # Print alternate text
    print(image['alt'])

对于带有img标签的文本

from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print(image['src'])

Python 3：

from bs4 import BeautifulSoup as BSHTML
import urllib

page = urllib.request.urlopen('https://github.com/abushoeb/emotag')
soup = BSHTML(page)
images = soup.findAll('img')

for image in images:
    # Print image source
    print(image['src'])
    # Print alternate text
    print(image['alt'])

如有需要，请安装模块

# Python 3
pip install beautifulsoup4
pip install urllib3