从Python正则表达式中提取匹配组

Question

从Python正则表达式中提取匹配组

3

我将尝试从Python字符串中提取匹配的组，但遇到了问题。

字符串如下所示。

1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc

我需要提取以数字和大写字母开头的内容作为标题，并提取该标题中的内容。

这是我期望的输出。

1. TITLE ABC Contents of title ABC and some other text
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title cdc

我尝试使用以下正则表达式

(\d\.\s[A-Z\s]*\s)

并且得到下面的内容。

1. TITLE ABC 
2. TITLE BCD 
3. TITLE CDC

如果我在正则表达式的末尾添加 .*，匹配组就会受到影响。我觉得我在这里缺少了一些简单的东西。我已经尝试了我所知道的一切，但是无法解决它。任何帮助都将不胜感激。

- Ashok KS

你的字符类组中缺少小写字母。 - Code Maniac

4个回答

1

在你的正则表达式中，你忽略了字符组中的小写字母，因此它只匹配大写单词。你可以简单地使用这个。

(\d\.[\s\S]+?)(?=\d+\.|$)

样例代码

import re
text = """1. TITLE ABC Contents of 14 title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""
result = new_s = re.findall('(\d\.[\s\S]+?)(?=\d+\.|$)', text)
print(result)

输出

['1. TITLE ABC Contents of 14 title ABC and some other text ', '2. TITLE BCD This would have contents on \ntitle BCD and maybe 
something else ', '3. TITLE CDC Contents of title cdc']

正则表达式演示

注意： 如果您使用单行标志，甚至可以将[\s\S]+?替换为.*?，这样.也将匹配换行符。

- Code Maniac

0

你可以使用 re.findall 和 re.split：

import re
s = "1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc"
t, c = re.findall('\d+\.\s[A-Z]+', s), list(filter(None, re.split('\d+\.\s[A-Z]+', s)))
result = [f'{a}{b}' for a, b in zip(t, c)]

输出：

['1. TITLE ABC Contents of title ABC and some other text ', '2. TITLE BCD This would have contents on title BCD and maybe something else ', '3. TITLE CDC Contents of title cdc']

- Ajax1234

该字符串没有标题。任何以数字开头，后跟全部大写字母的文本都被视为标题。这些数据仅为演示用途。对于我的情况，标题数量可能长达1000个。 - Ashok KS

0

import re
a=r'1. TITLE ABC Contents of 2title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc'
res = re.findall('(\d\.\s[A-Za-z0-9\s]*\s)', a)
for e in map(str, res):
    print(e)

输出

1. TITLE ABC Contents of 2title ABC and some other text 
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title

- moys

1

我认为你的意思是“不需要”。明白了。 - moys

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hostingutilities.com · Accepted Answer

使用 (\d+\.[\da-z]* [A-Z]+[\S\s]*?(?=\d+\.|$))

以下是相关代码

import re
text = """1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""

result = re.findall('('
                    '\d+\.'   # Match a number and a '.' character
                    '[\da-z]*' # If present include any additional numbers/letters
                    '(?:\.[\da-z])*' # Match additional subpoints.
                                     # Each of these subpoints must start with a '.'
                                     # And then have any combination of numbers/letters
                    ' '   # Match a space. This is how we know to stop looking for subpoints, 
                          # and to start looking for capital letters
                    '[A-Z]+'  # Match at least one capital letter. 
                              # Use [A-Z]{2,} to match 2 or more capital letters
                    '[\S\s]*?'  # Match everything including newlines.
                                # Use .*? if you don't care about matching newlines
                    '(?=\d+\.|$)'  # Stop matching at a number and a '.' character, 
                                   # or stop matching at the end of the string,
                                   # and don't include this match in the results.
                    ')'
                    , text)

这里还有更详细的解释，解释了每个正则表达式中所使用的字符。