如何从HTML中提取某些URL？

Question

如何从HTML中提取某些URL？

3

我需要从本地HTML文件中提取所有图片链接。不幸的是，我无法安装bs4和cssutils来处理HTML。

html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""

我试图使用正则表达式提取数据：

images = []
for line in html.split('\n'):
    images.append(re.findall(r'(https://s2.*\?lastmod=\d+)', line))
print(images)

[['https://s2.example.com/path/image0.jpg?lastmod=1625296911'],
 ['https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912']]

我猜我的正则表达式是贪婪的，因为我使用了.*？如何获得以下结果？

images = ['https://s2.example.com/path/image0.jpg',
          'https://s2.example.com/path/image1.jpg',
          'https://s2.example.com/path/image2.jpg',
          'https://s2.example.com/path/image3.jpg']

如果可以的话，所有链接都应该被src="..."或者url(...)包含。谢谢您的帮助。

- user16312732

3个回答

0

您可以使用

import re
html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""
images = re.findall(r'https://s2[^\s?]*(?=\?lastmod=\d)', html)
print(images)

查看Python演示。输出：

['https://s2.example.com/path/image0.jpg',
 'https://s2.example.com/path/image1.jpg',
 'https://s2.example.com/path/image2.jpg', 
 'https://s2.example.com/path/image3.jpg']

请参考正则表达式演示。它的意思是

https://s2 - 一些字面文本
[^\s?]* - 零个或多个非空格和?字符
(?=\?lastmod=\d) - 紧接在右侧，必须有 ?lastmode= 和一个数字（因为它是正向前瞻内的模式，所以文本不会添加到匹配中）。

- Wiktor Stribiżew

0

import re
xx = '<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911" alt="asdasd"><img a src="https://s2.example.com/path/image0.jpg?lastmod=1625296911">'
r1 = re.findall(r"<img(?=\s|>)[^>]*>",xx)
url = []
for x in r1:
  x = re.findall(r"src\s{0,}=\s{0,}['\"][\w\d:/.=]{0,}",x)
  if(len(x)== 0): continue
  x = re.findall(r"http[s]{0,1}[\w\d:/.=]{0,}",x[0])
  if(len(x)== 0): continue
  url.append(x[0])
print(url)

- carlos alfredo castellanos cru

2

您的答案可以通过添加更多支持性信息来改进。请[编辑]以添加进一步的细节，例如引用或文档，以便他人可以确认您的答案是正确的。您可以在帮助中心中找到有关如何编写良好答案的更多信息。 - Community

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hasleron · Accepted Answer

import re
indeces_start = sorted(
    [m.start()+5 for m in re.finditer("src=", html)]
    + [m.start()+4 for m in re.finditer("url", html)])
indeces_end = [m.end() for m in re.finditer(".jpg", html)]

image_list = []

for start,end in zip(indeces_start,indeces_end):
  image_list.append(html[start:end])

print(image_list)

这是我想到的一个解决方案。它包括查找图像路径字符串的起始和结束索引。如果有不同类型的图像，则显然需要进行调整。

编辑：更改了起始标准，以防文档中有其他URL。