Python正则表达式匹配行结尾是否以什么结尾？

Question

Python正则表达式匹配行结尾是否以什么结尾？

3

我正在尝试爬取以下内容：

        <p>Some.Title.html<br />
<a href="https://www.somelink.com/yep.html" rel="nofollow">https://www.somelink.com/yep.html</a><br />
Some.Title.txt<br />
<a href="https://www.somelink.com/yeppers.txt" rel="nofollow">https://www.somelink.com/yeppers.txt</a><br />

我尝试了以下几种变化：

match = re.compile('^(.+?)<br \/><a href="https://www.somelink.com(.+?)">',re.DOTALL).findall(html)

我想匹配带有“p”标签和没有“p”标签的行。只有第一次出现“p”标签。我不太擅长Python，所以我很生疏，在这里和谷歌上搜索了很久，但似乎都不太一样。非常感谢在我困难时得到的帮助。

期望的输出是一个索引：

<a href="Some.Title.html">http://www.SomeLink.com/yep.html</a>
<a href="Some.Title.txt">http://www.SomeLink.com/yeppers.txt</a>

- Bobby Peters

7

提示：不要使用 regex 来解析 HTML，使用像 BeautifulSoup 这样专门用于此的工具。 - Vinícius Figueiredo

我完全不知道如何使用beautiful soup。这种情况非常少见。感谢您的建议，我真的应该学习一下，以备不时之需。 - Bobby Peters

2

只是如果你真的需要深入了解HTML解析，强烈建议使用专门为此编写的工具，因为“正则表达式”无法处理嵌套模式。期望的输出是什么？ - Vinícius Figueiredo

1

很少有这样的机会学习！正如@ViníciusAguiar所提到的，Beautiful Soup是更适合此用例的更好解决方案。 - karthikr

1

这可能会有所帮助 :) 正如上面的其他评论所提到的，一定要尝试使用BeautifulSoup。 - Julian Chan

显示剩余2条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Matthew Barlowe · Accepted Answer

使用Beautiful Soup和requests模块会比正则表达式更适合这种任务，正如上面的评论者所指出的那样。

import requests
import bs4

html_site = 'www.google.com' #or whatever site you need scraped
site_data = requests.get(html_site) # downloads site into a requests object
site_parsed = bs4.BeautifulSoup(site_data.text) #converts site text into bs4 object
a_tags = site_parsed.select('a') #this will select all 'a' tags and return list of them

这只是一个简单的代码，它将从html网站中选择所有的标签，并按照上面所示的格式将它们存储在列表中。我建议在这里查看bs4的不错教程和这里查看实际文档。