提取正则表达式匹配的部分

Question

提取正则表达式匹配的部分

241

我想要一个正则表达式来从HTML页面中提取标题。目前我有这个：

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '')

有没有一个正则表达式可以只提取<title>标签中的内容，而不用删除标签本身？

- hoju

10

哇，我简直不敢相信这么多的回复都要求解析整个 HTML 页面才能提取一个简单的标题。真是杀鸡焉用牛刀！ - hoju

5

问题标题已经说得很清楚了 - 给出的示例碰巧是HTML，但这个一般问题是普适的。 - Phil

11个回答

67

注意，从Python 3.8开始，引入了赋值表达式(PEP 572)(:=运算符)，可以通过在if条件中直接捕获匹配结果作为变量并在条件体中重复使用它来改进Krzysztof Krasoń的解决方案。

# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
  title = match.group(1)
# hello

- Xavier Guihot

6

哦，那很漂亮。 - EdwardG

12

尝试使用捕获组：

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

- Aaron Maenpaa

9

我建议您使用Beautiful Soup，这是一个非常好的库，可以解析您的HTML文档。

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

- kharagpur

我想补充一下，beautifulsoup也可以解析不完整的HTML，这真的很好。 - endre

7

尝试：

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

- Randy

如果你真的想使用正则表达式来解析HTML，请不要直接在匹配上运行.group()，因为它可能返回None。 - iElectric

2

你应该使用 .*?，这样在文档中有多个 </title> 的情况下（虽然不太可能），也能正确匹配。 - tonfa

@iElectric：如果你真的想的话，可以将它放在try except块中，对吧？ - tonfa

6

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

- Vinay Sajip

4

提供的代码无法处理异常。我建议：

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

如果未找到模式或第一个匹配项，则默认情况下返回空字符串。

- Steve K

4

我认为这应该足够了：

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

假设你的文本（HTML）保存在名为“text”的变量中。

这也假设没有其他可以合法嵌入HTML TITLE标记内部的HTML标记，并且不存在任何一种合法方式可以在这样的容器/块内嵌入任何其他<字符。

然而...

不要在Python中使用正则表达式进行HTML解析。使用HTML解析器！（除非您打算编写完整的解析器，否则当各种HTML、SGML和XML解析器已经在标准库中时，这将是多余的工作）。

如果您正在处理“真实世界”标记汤HTML（这通常不符合任何SGML/XML验证器），那么请使用BeautifulSoup软件包。虽然它尚未加入标准库，但广泛推荐用于此目的。

另一个选择是：lxml...它是为结构良好（符合标准）的HTML编写的。但它有一个选项可以回退到使用BeautifulSoup作为解析器：ElementSoup。

- Jim Dennis

这里的re.MULTILINE是用来做什么的？它改变了行首^和行尾$，但是你都没有使用到。 - bers

3

Krzysztof Krasoń的当前得票最高的答案在<title>a</title><title>b</title>中失败。此外，它忽略了跨越行边界的标题标签，例如出于换行原因。最后，它在<title >a</title>中失败（这是有效的HTML：XML / HTML标记内部的空格）。

因此，我建议以下改进：

import re

def search_title(html):
    m = re.search(r"<title\s*>(.*?)</title\s*>", html, re.IGNORECASE | re.DOTALL)
    return m.group(1) if m else None

测试用例：

print(search_title("<title   >with spaces in tags</title >"))
print(search_title("<title\n>with newline in tags</title\n>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newline\n in title</title\n>"))

输出：

with spaces in tags
with newline in tags
first of two titles
with newline
  in title

最终，我赞同其他人对HTML解析器的推荐 - 不仅如此，而且还要处理非标准使用的HTML标签。

- bers

2

我需要匹配版本号为package-0.0.1的内容，但不想接受0.0.010这样的无效版本号。

请参考regex101示例。

import re

RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?)\.){2}(?:0|[1-9](?:[0-9]+)?))$')

example = 'hello-0.0.1'

if match := RE_IDENTIFIER.search(example):
    name, version = match.groups()
    print(f'Name:     {name}')
    print(f'Version:  {version}')
else:
    raise ValueError(f'Invalid identifier {example}')

输出：

Name:     hello
Version:  0.0.1

- Stefan Falk

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Krzysztof Krasoń · Accepted Answer

在正则表达式中使用 ( )，并且在 Python 中使用 group(1) 来检索捕获的字符串（如果 re.search 没有找到结果，则会返回 None，所以不要直接使用 group()）：

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)