使用正则表达式re.sub来删除指定单词及其之前的所有内容

Question

使用正则表达式re.sub来删除指定单词及其之前的所有内容

8

我有一个字符串，形如“Blah blah blah, Updated: Aug. 23, 2012”，我希望使用正则表达式仅提取日期Aug. 23, 2012。我在网上找到了一篇与此类似的文章：正则表达式删除某个字符之前的所有文本，但是当我尝试时它并没有起作用。

date_div = "Blah blah blah, Updated: Aug. 23, 2012"
extracted_date = re.sub('^[^Updated]*',"", date_div)

如何删除“Updated”及之前的所有内容，仅保留“Aug. 23, 2012”？

谢谢！

- maudulus

3个回答

6

使用正则表达式时，根据单词的出现情况，您可以使用两个正则表达式：

# Remove all up to the first occurrence of the word including it (non-greedy):
^.*?word
# Remove all up to the last occurrence of the word including it (greedy):
^.*word

请看非贪婪正则表达式演示和贪婪正则表达式演示。 ^匹配字符串的开头，.*?匹配任何0+个字符（请注意使用re.DOTALL标志，以便.可以匹配换行符）尽可能少地匹配（.*尽可能多地匹配），然后word匹配并消耗（即添加到匹配项并推进正则表达式索引）该单词。

请注意使用re.escape(up_to_word)：如果您的up_to_word不仅由字母数字和下划线字符组成，则更安全的方法是使用re.escape，以便特殊字符如(、[、?等不能阻止正则表达式找到有效匹配项。

请参见Python演示：

import re

date_div = "Blah blah\nblah, Updated: Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019"

up_to_word = "Updated:"
rx_to_first = r'^.*?{}'.format(re.escape(up_to_word))
rx_to_last = r'^.*{}'.format(re.escape(up_to_word))

print("Remove all up to the first occurrence of the word including it:")
print(re.sub(rx_to_first, '', date_div, flags=re.DOTALL).strip())
print("Remove all up to the last occurrence of the word including it:")
print(re.sub(rx_to_last, '', date_div, flags=re.DOTALL).strip())

输出：

Remove all up to the first occurrence of the word including it:
Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019
Remove all up to the last occurrence of the word including it:
Feb. 13, 2019

- Wiktor Stribiżew

5

您可以使用“Lookahead”：

import re
date_div = "Blah blah blah, Updated: Aug. 23, 2012"
extracted_date = re.sub('^(.*)(?=Updated)',"", date_div)
print extracted_date

输出

Updated: Aug. 23, 2012

编辑
如果MattDMo在下面的评论中是正确的，而你也想要同时移除“更新：”，你可以这样做：

extracted_date = re.sub('^(.*Updated: )',"", date_div)

- Nir Alfasi

1

我认为OP想要删除“Updated:”这个词。 - MattDMo

工作得非常好。谢谢 :) - maudulus

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mkriheli · Accepted Answer

在这种情况下，你可以不用正则表达式来实现，例如：

>>> date_div = "Blah blah blah, Updated: Aug. 23, 2012"
>>> date_div.split('Updated: ')
['Blah blah blah, ', 'Aug. 23, 2012']
>>> date_div.split('Updated: ')[-1]
'Aug. 23, 2012'