使用正则表达式时,根据单词的出现情况,您可以使用两个正则表达式:
# Remove all up to the first occurrence of the word including it (non-greedy):
^.*?word
# Remove all up to the last occurrence of the word including it (greedy):
^.*word
请看
非贪婪正则表达式演示和
贪婪正则表达式演示。
^
匹配字符串的开头,
.*?
匹配任何0+个字符(请注意使用
re.DOTALL
标志,以便
.
可以匹配换行符)尽可能少地匹配(
.*
尽可能多地匹配),然后
word
匹配并消耗(即添加到匹配项并推进正则表达式索引)该单词。
请注意使用
re.escape(up_to_word)
:如果您的
up_to_word
不仅由字母数字和下划线字符组成,则更安全的方法是使用
re.escape
,以便特殊字符如
(
、
[
、
?
等不能阻止正则表达式找到有效匹配项。
请参见
Python演示:
import re
date_div = "Blah blah\nblah, Updated: Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019"
up_to_word = "Updated:"
rx_to_first = r'^.*?{}'.format(re.escape(up_to_word))
rx_to_last = r'^.*{}'.format(re.escape(up_to_word))
print("Remove all up to the first occurrence of the word including it:")
print(re.sub(rx_to_first, '', date_div, flags=re.DOTALL).strip())
print("Remove all up to the last occurrence of the word including it:")
print(re.sub(rx_to_last, '', date_div, flags=re.DOTALL).strip())
输出:
Remove all up to the first occurrence of the word including it:
Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019
Remove all up to the last occurrence of the word including it:
Feb. 13, 2019