匹配所有内容,直到整个正则表达式再次匹配

3

这是我正在尝试匹配的字符串(实际的字符串要长得多)。

VC1000 Venture Capital 4 cr.
This is a class about venture capital
and more description, that could mention a future course like
VC2000 but might not
VC2000 venture capital II 4 cr.
Another description about blah
VC 3000 venture capital III 4-6 cr.
back again

我正在尝试获取类似以下的组:

  • [VC1000]
  • [风险投资]
  • [4]
  • [这是一个关于风险投资和更多描述的课程,可能会提到未来的课程,如VC2000,但也可能不会]

我几乎做到了,但我不确定如何获取类别列表之间的描述。现在我有:

(^\*?[A-Z]{2}\s?[0-9]{4}) (.*?)([0-9]|[0-9]-[0-9]+)\s?cr\.

但我不确定该怎么继续。添加.*匹配太多,而使用上面的第一组.*会防止第一组被捕获每个其他匹配。

我缺少什么技巧?


课程行尾总是以cr.结尾吗? - Nick
尝试在正则表达式的开头加上起始字符串“^”和结尾字符串“$”。 - lemon
1个回答

2

尝试(regex101):

import re

pat = r'^([A-Z]{2}\s*\d{4})\s+([^\n]+?)(\d+-?\d*\s+cr\.)$(.*?)(?=^[A-Z]{2}\s*\d{4}\s+[^\n]+?\d+-?\d*\s+cr\.$|\Z)'
pat = re.compile(pat, flags=re.S|re.M)

text = '''\
VC1000 Venture Capital 4 cr.
This is a class about venture capital
and more description, that could mention a future course like
VC2000 but might not
VC2000 venture capital II 4 cr.
Another description about blah
VC 3000 venture capital III 4-6 cr.
back again'''

for a, b, c, d in pat.findall(text):
    print(a)
    print(b)
    print(c)
    print(d)
    print('-' * 80)

输出:

VC1000
Venture Capital 
4 cr.

This is a class about venture capital
and more description, that could mention a future course like
VC2000 but might not

--------------------------------------------------------------------------------
VC2000
venture capital II 
4 cr.

Another description about blah

--------------------------------------------------------------------------------
VC 3000
venture capital III 
4-6 cr.

back again
--------------------------------------------------------------------------------

1
刚刚写了几乎完全相同的正则表达式... - Nick

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接