Python中两个字符串之间的正则表达式文本

Question

Python中两个字符串之间的正则表达式文本

4

我有一些这样的文本：

CustomerID:1111,

text1

CustomerID:2222,

text2

CustomerID:3333,

text3

CustomerID:4444,

text4

CustomerID:5555,

text5

每个文本都有多行。

我想将每个客户ID和对应的文本存储在元组中（例如(1111, text1), (2222, text2)等）。

首先，我使用以下表达式：

re.findall('CustomerID:(\d+)(.*?)CustomerID:', rawtxt, re.DOTALL)

然而，我仅收到(1111, text1)、(3333, text3)、(5555, text5) 等信息......

- mfg_2018

5个回答

2

re.findall(r'CustomerID:(\d+),\s*(.*?)\s*(?=CustomerID:|$)', rawtxt, re.DOTALL)

findall只返回groups。使用lookahead来停止non greedy量词。建议使用r或raw模式来指定您的正则表达式。如果不使用lookahead，那么下一个匹配的customerid将被消耗，因此下一个匹配将不存在。通过使用不消耗string的lookahead来移除重叠的匹配。

- vks

re.DOTALL 的功能是什么？ - SIslam

1

@SIslam。默认情况下，.不匹配\n或newline。使用此标志后，它会匹配多行。因此，现在.*将匹配多行。 - vks

啊！这里使用或不使用 re.DOTALL 输出的结果都一样！ - SIslam

@SIslam 因为我们正在使用 \s 来覆盖换行符 \n 或 newlines。 - vks

那么在这种情况下，我们需要使用 re.DOTALL 吗？谢谢。 - SIslam

@SIslam 如果text3是类似于test 3 asas \n asddsa这样的东西，则为真。 - vks

1

给定：

>>> txt='''\
... CustomerID:1111,
... 
... text1
... 
... CustomerID:2222,
... 
... text2
... 
... CustomerID:3333,
... 
... text3
... 
... CustomerID:4444,
... 
... text4
... 
... CustomerID:5555,
... 
... text5'''

你可以做：

>>> [re.findall(r'^(\d+),\s+(.+)', block) for block in txt.split('CustomerID:') if block]
[[('1111', 'text1')], [('2222', 'text2')], [('3333', 'text3')], [('4444', 'text4')], [('5555', 'text5')]]

如果是多行文本，可以这样做：

>>> [re.findall(r'^(\d+),\s+([\s\S]+)', block) for block in txt.split('CustomerID:') if block]
[[('1111', 'text1\n\n')], [('2222', 'text2\n\n')], [('3333', 'text3\n\n')], [('4444', 'text4\n\n')], [('5555', 'text5')]]

- dawg

1

另一个简单的例子可能是-

>>>re.findall(r'(\b\d+\b),\s*(\btext\d+\b)', rawtxt)
>>>[('1111', 'text1'), ('2222', 'text2'), ('3333', 'text3'), ('4444', 'text4'), ('5555', 'text5')]

编辑-如果需要（对于顺序更差的数据），请使用filter

filter(lambda x: len(x)>1,re.findall(r'(\b\d+\b),\s*(\btext\d+\b)', rawtxt))

查看演示 实时演示

- SIslam

0

re.findall不是最好的工具，因为正则表达式总是贪婪的，并且会尝试用文本吞噬所有后续的customerID。

一个实际上为此创建的工具是re.split。括号捕获id号码并过滤掉“CustomerID”。第二行将令牌缝合成您想要的元组：

toks = re.split(r'CustomerID:(\d{4}),\n', t)
zip(toks[1::2],toks[2::2])

编辑：在zip()中更正了索引。更正后的示例输出：

[('1111', 'text1\n'),
 ('2222', 'text2\n'),
 ('3333', 'text3\n'),
 ('4444', 'text4\n'),
 ('5555', 'text5')]

- Muposat

这不是 OP 想要的，你的表达式返回 [('1111', '2222'), ('2222', '3333'), ('3333', '4444'), ('4444', '5555')]。 - SIslam

@SIslam... 我会更正为 toks[2::2] 而不是 toks[3::2]。 - Muposat

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Remi Guan · Accepted Answer

实际上这里不需要正则表达式：

>>> with open('file') as f:
...     rawtxt = [i.strip() for i in f if i != '\n']
...     
>>> l = []
>>> for i in [rawtxt[i:i+2] for i in range(0, len(rawtxt), 2)]:
...     l.append((i[0][11:-1], i[1]))
...     
... 
>>> l
[('1111', 'text1'), ('2222', 'text2'), ('3333', 'text3'), ('4444', 'text4'), ('5
555', 'text5')]
>>>

如果您需要将1111，2222等转换成整数，请使用l.append((int(i[0][11:-1]), i[1]))代替l.append((i[0][11:-1], i[1]))。请保留HTML标签。