从Python正则表达式中提取匹配组

3
我将尝试从Python字符串中提取匹配的组,但遇到了问题。
字符串如下所示。
1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc

我需要提取以数字和大写字母开头的内容作为标题,并提取该标题中的内容。

这是我期望的输出。

1. TITLE ABC Contents of title ABC and some other text
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title cdc

我尝试使用以下正则表达式

(\d\.\s[A-Z\s]*\s)

并且得到下面的内容。
1. TITLE ABC 
2. TITLE BCD 
3. TITLE CDC

如果我在正则表达式的末尾添加 .*,匹配组就会受到影响。我觉得我在这里缺少了一些简单的东西。我已经尝试了我所知道的一切,但是无法解决它。任何帮助都将不胜感激。

你的字符类组中缺少小写字母。 - Code Maniac
4个回答

2

使用 (\d+\.[\da-z]* [A-Z]+[\S\s]*?(?=\d+\.|$))

以下是相关代码

import re
text = """1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""

result = re.findall('('
                    '\d+\.'   # Match a number and a '.' character
                    '[\da-z]*' # If present include any additional numbers/letters
                    '(?:\.[\da-z])*' # Match additional subpoints.
                                     # Each of these subpoints must start with a '.'
                                     # And then have any combination of numbers/letters
                    ' '   # Match a space. This is how we know to stop looking for subpoints, 
                          # and to start looking for capital letters
                    '[A-Z]+'  # Match at least one capital letter. 
                              # Use [A-Z]{2,} to match 2 or more capital letters
                    '[\S\s]*?'  # Match everything including newlines.
                                # Use .*? if you don't care about matching newlines
                    '(?=\d+\.|$)'  # Stop matching at a number and a '.' character, 
                                   # or stop matching at the end of the string,
                                   # and don't include this match in the results.
                    ')'
                    , text)

正则表达式解释的图示

这里还有更详细的解释,解释了每个正则表达式中所使用的字符。


这个解决方案非常好,适用于大多数情况。但是如果内容中有数字,则会出现问题。例如,如果文本是“1. TITLE ABC标题ABC的内容以及14天的一些其他文本”,那么就会出现问题。 - Ashok KS
我已经编辑了我的答案,使其在标题中有数字时也能正常工作。 - hostingutilities.com
感谢提供的解决方案。如果内容包含子项,例如2.1等,则无法正常工作。
  1. 标题ABC 标题ABC的内容以及其他一些文本
  2. 标题BCD 这将包括标题BCD和可能的其他内容 2.2 文本部分 2.3 文本部分
  3. 标题CDC 标题cdc的内容
这方面有什么建议吗?
- Ashok KS
我还使它能够处理多个子点,所以 1.1.11.a1.2.a1.2.3.4.5 都是有效的。 - hostingutilities.com

1
在你的正则表达式中,你忽略了字符组中的小写字母,因此它只匹配大写单词。你可以简单地使用这个。
(\d\.[\s\S]+?)(?=\d+\.|$)

enter image description here

样例代码
import re
text = """1. TITLE ABC Contents of 14 title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""
result = new_s = re.findall('(\d\.[\s\S]+?)(?=\d+\.|$)', text)
print(result)

输出


['1. TITLE ABC Contents of 14 title ABC and some other text ', '2. TITLE BCD This would have contents on \ntitle BCD and maybe 
something else ', '3. TITLE CDC Contents of title cdc']

正则表达式演示

注意: 如果您使用单行标志,甚至可以将[\s\S]+?替换为.*?,这样.也将匹配换行符。


0
你可以使用 re.findallre.split

import re
s = "1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc"
t, c = re.findall('\d+\.\s[A-Z]+', s), list(filter(None, re.split('\d+\.\s[A-Z]+', s)))
result = [f'{a}{b}' for a, b in zip(t, c)]

输出:

['1. TITLE ABC Contents of title ABC and some other text ', '2. TITLE BCD This would have contents on title BCD and maybe something else ', '3. TITLE CDC Contents of title cdc']

该字符串没有标题。任何以数字开头,后跟全部大写字母的文本都被视为标题。这些数据仅为演示用途。对于我的情况,标题数量可能长达1000个。 - Ashok KS

0
import re
a=r'1. TITLE ABC Contents of 2title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc'
res = re.findall('(\d\.\s[A-Za-z0-9\s]*\s)', a)
for e in map(str, res):
    print(e)

输出

1. TITLE ABC Contents of 2title ABC and some other text 
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title 


1
我认为你的意思是“不需要”。明白了。 - moys

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接