我正在编写一个函数,用于在大段文本中查找与某个或多个相同字符串相邻的字符串。目前为止还算顺利,只是不够美观。
我遇到了问题,即如何将结果字符串裁剪到最近的句子/整个单词,而不会留下任何字符悬空。裁剪的距离基于关键字两侧的单词数量计算。
我遇到了问题,即如何将结果字符串裁剪到最近的句子/整个单词,而不会留下任何字符悬空。裁剪的距离基于关键字两侧的单词数量计算。
keyword = "marble"
string = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
with 1 word distance (either side of key word) it should result in:
2 occurrences found
"This marble is..."
"...this marble. Kwoo-oooo-waaa!"
with 2 word distance:
2 occurrences found
"Right. This marble is as..."
"...as this marble. Kwoo-oooo-waaa! Ahhhk!"
目前我得到的基于字母而不是单词距离。
2 occurrences found
"ght. This marble is as sli"
"y as this marble. Kwoo-ooo"
然而,正则表达式可以将其分割到最近的整个单词或句子。这是实现这一目标最Pythonic的方法吗?我的处理方式如下:
import re
def trim_string(s, num):
trimmed = re.sub(r"^(.{num}[^\s]*).*", "$1", s) # will only trim from left and not very well
#^(.*)(marble)(.+) # only finds second occurrence???
return trimmed
s = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
t = "Marble"
if t.lower() in s.lower():
count = s.lower().count(t.lower())
print ("%s occurrences of %s" %(count, t))
original_s = s
for i in range (0, count):
idx = s.index(t.lower())
# print idx
dist = 10
start = idx-dist
end = len(t) + idx+dist
a = s[start:end]
print a
print trim_string(a,5)
s = s[idx+len(t):]
谢谢。
.split()
,然后使用列表索引来操作子集并重新将单词连接成一个字符串。如果这对你有好处,它可以避免使用正则表达式。 - Matt R. Wilson