使用Python从HTML页面中提取图像

Question

使用Python从HTML页面中提取图像

4

以下是我的代码。它尝试获取HTML中图像标签内图像的src。

import re
for text in open('site.html'):
  matches = re.findall(r'\ssrc="([^"]+)"', text)
  matches = ' '.join(matches)
print(matches)

问题在于当我输入类似以下内容时：

<img src="asdfasdf">

它可以工作，但当我放入整个HTML页面时，它什么都不返回。为什么会这样？我该如何解决？

Site.html只是标准格式的网站HTML代码。我希望它忽略一切内容，只打印图像的源代码。如果您想查看site.html中的内容，请转到基本的HTML网页并复制所有源代码。

- NoviceProgrammer

2个回答

0

你可以通过使用Beautiful Soup和Base64模块来实现这一点。

    import base64
    from bs4 import BeautifulSoup as BS

    with open('site.html') as html_wr:
        html_data = html_wr.read()

    soup = BS(html_data)
    
    for ind,imagetag in enumerate(soup.findall('img')): 
         image_data_base64 = imagetag['src'].split(',')[1]
         decoded_img_data = base64.b64decode(image_data_base64)
         with open(f'site_{ind}.png','wb+') as img_wr:
             img_wr.write(decode_img_data)

    ##############################################################
    # if you want particular images you can use x-path
    
    import base64
    from lxml import etree
    from bs4 import BeautifulSoup as BS
    
    with open('site.html') as html_wr:
        html_data = html_wr.read()

    soup = BS(html_data)
    dom = etree.HTML(str(soup))
    img_links = dom.xpath('')  #insert the x-path
    
    for ind,imagetag in enumerate(img_links): 
         image_data_base64 = imagetag.values()[3].split(',')[1]
         decoded_img_data = base64.b64decode(image_data_base64)
         with open(f'site_{ind}.png','wb+') as img_wr:
             img_wr.write(decode_img_data)

- itto shura

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- TerryA · Accepted Answer

为什么要使用正则表达式去解析HTML，当你可以用类似BeautifulSoup这样的工具轻松完成呢：

>>> from bs4 import BeautifulSoup as BS
>>> html = """This is some text
... <img src="asdasdasd">
... <i> More HTML <b> foo </b> bar </i>
... """
>>> soup = BS(html)
>>> for imgtag in soup.find_all('img'):
...     print(imgtag['src'])
... 
asdasdasd

你的代码无法正常工作的原因是因为text是文件中的一行。因此，在每次迭代中，你只能找到一行中的匹配项。虽然这可能有效，但是请考虑一下最后一行没有图像标签的情况。matches将是一个空列表，并且join会使它变成''。你正在覆盖每行的变量matches。

你需要在整个HTML上调用findall:

import re
with open('site.html') as html:
    content = html.read()
    matches = re.findall(r'\ssrc="([^"]+)"', content)
    matches = ' '.join(matches)

print(matches)

在这里使用with语句更加符合Python风格。这也意味着你不需要在之后调用file.close()，因为with语句会处理它。