将字符串(左右两侧)修剪为最近的单词或句子

3
我正在编写一个函数,用于在大段文本中查找与某个或多个相同字符串相邻的字符串。目前为止还算顺利,只是不够美观。
我遇到了问题,即如何将结果字符串裁剪到最近的句子/整个单词,而不会留下任何字符悬空。裁剪的距离基于关键字两侧的单词数量计算。
keyword = "marble"
string = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

with 1 word distance (either side of key word) it should result in:
2 occurrences found
"This marble is..."
"...this marble. Kwoo-oooo-waaa!"

with 2 word distance:
2 occurrences found
"Right. This marble is as..."
"...as this marble. Kwoo-oooo-waaa! Ahhhk!"

目前我得到的基于字母而不是单词距离。

2 occurrences found
"ght. This marble is as sli"
"y as this marble. Kwoo-ooo"

然而,正则表达式可以将其分割到最近的整个单词或句子。这是实现这一目标最Pythonic的方法吗?我的处理方式如下:

import re

def trim_string(s, num):
  trimmed = re.sub(r"^(.{num}[^\s]*).*", "$1", s) # will only trim from left and not very well
  #^(.*)(marble)(.+) # only finds second occurrence???

  return trimmed

s = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
t = "Marble"


if t.lower() in s.lower():

  count = s.lower().count(t.lower())
  print ("%s occurrences of %s" %(count, t))

  original_s = s

  for i in range (0, count):
    idx = s.index(t.lower())
    # print idx

    dist = 10
    start = idx-dist
    end = len(t) + idx+dist
    a = s[start:end]

    print a
    print trim_string(a,5)

    s = s[idx+len(t):]

谢谢。


你想如何处理空格?如果你只考虑单词之间的单个空格,你可以在输入文本上使用 .split(),然后使用列表索引来操作子集并重新将单词连接成一个字符串。如果这对你有好处,它可以避免使用正则表达式。 - Matt R. Wilson
如果您是指结果中的前导或尾随空格,我不需要它们。省略号(...)的包含是为了说明该字符串已在该点处被打断。 - Ghoul Fool
4个回答

3
你可以使用这个正则表达式来匹配 marble 前后的最多 N 个非空白子字符串:

2个单词:

(?:(?:\S+\s+){0,2})?\bmarble\b\S*(?:\s+\S+){0,2}

正则表达式分解:
(?:(?:\S+\s+){0,2})? # match up to 2 non-whitespace string before keyword (lazy)
\bmarble\b\S*        # match word "marble" followed by zero or more non-space characters
(?:\s+\S+){0,2}      # match up to 2 non-whitespace string after keyword

正则表达式演示

1个单词的正则表达式:

(?:(?:\S+\s+){0,1})?\bmarble\b\S*(?:\s+\S+){0,1}

正则表达式还会捕获类似marbleilz这样的单词,应该在单词后面使用\W*而不是\S* - Dror Av.
如果.后面没有空格,仍然存在一个错误,例如- https://regex101.com/r/8HAdYg/3 - Dror Av.
1
可能是:(?:(?:\S+\s+){0,2})?\b大理石\b\S?(?:\s*\S+){0,2},但我们不知道单词之间缺少空格是否是一种真实的使用情况。只有原作者可以告诉我们。 - anubhava
这很公平 :) - Dror Av.

2

如果您忽略标点符号,您可以在不使用re的情况下完成此操作:

import itertools as it
import string

def nwise(iterable, n):
    ts = it.tee(iterable, n)
    for c, t in enumerate(ts):
        next(it.islice(t, c, c), None)
    return zip(*ts)

def grep(s, k, n):
    m = str.maketrans('', '', string.punctuation)
    return [' '.join(x) for x in nwise(s.split(), n*2+1) if x[n].translate(m).lower() == k]

In []
keyword = "marble"
sentence = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"
print('...\n...'.join(grep(sentence, keyword, n=2)))

Out[]:
Right. This marble is as...
...as this marble. Kwoo-oooo-waaa! Ahhhk!

In []:
print('...\n...'.join(grep(sentence, keyword, n=1)))

Out[]:
This marble is...
...this marble. Kwoo-oooo-waaa!

1
使用这个答案中的ngrams()函数,这里提供了一种方法,它只获取所有的n-gram,然后选择中间带有keyword的那些。
def get_ngrams(document, n):
    words = document.split(' ')
    ngrams = []
    for i in range(len(words)-n+1):
        ngrams.append(words[i:i+n])
    return ngrams

keyword = "marble"
string = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

n = 3
pos = int(n/2 - .5)
# ignore punctuation by matching the middle word up to the number of chars in keyword
result = [ng for ng in get_ngrams(string, n) if ng[pos][:len(keyword)] == keyword]

0

more_itertools.adajacent1 是一个探测相邻元素的工具。

import operator as op
import itertools as it

import more_itertools as mit


# Given
keyword = "marble"
iterable = "Right. This marble is as slippery as this marble. Kwoo-oooo-waaa! Ahhhk!"

代码

words = iterable.split(" ")
pred = lambda x: x in (keyword, "".join([keyword, "."]))

neighbors = mit.adjacent(pred, words, distance=1)    
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]
# Out: ['This marble is', 'this marble. Kwoo-oooo-waaa!']

neighbors = mit.adjacent(pred, words, distance=2)
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]
# Out: ['Right. This marble is as', 'as this marble. Kwoo-oooo-waaa! Ahhhk!']

原帖作者可以根据需要自行调整这些结果的最终输出。


细节

给定的字符串已被拆分为可迭代的单词。定义了一个简单谓词2,如果在可迭代对象中找到关键字(或带有尾随句点的关键字),则返回True

words = iterable.split(" ")
pred = lambda x: x in (keyword, "".join([keyword, "."]))

neighbors = mit.adjacent(pred, words, distance=1)
list(neighbors)

more_itertools.adjacent工具返回一个(bool, word)元组列表:

输出

[(False, 'Right.'),
 (True, 'This'),
 (True, 'marble'),
 (True, 'is'),
 (False, 'as'),
 (False, 'slippery'),
 (False, 'as'),
 (True, 'this'),
 (True, 'marble.'),
 (True, 'Kwoo-oooo-waaa!'),
 (False, 'Ahhhk!')]

对于任何关键词和相邻单词距离为1的有效出现,第一个索引是True。我们使用这个布尔值和{{link1:itertools.groupby}}来查找并分组连续的相邻项。例如:

neighbors = mit.adjacent(pred, words, distance=1)
[(k, list(g)) for k, g in it.groupby(neighbors, op.itemgetter(0))]

输出

[(False, [(False, 'Right.')]),
 (True, [(True, 'This'), (True, 'marble'), (True, 'is')]),
 (False, [(False, 'as'), (False, 'slippery'), (False, 'as')]),
 (True, [(True, 'this'), (True, 'marble.'), (True, 'Kwoo-oooo-waaa!')]),
 (False, [(False, 'Ahhhk!')])]

最后,我们应用一个条件来过滤掉False组,并将字符串连接在一起。
neighbors = mit.adjacent(pred, words, distance=1)    
[" ".join([items[1] for items in g]) for k, g in it.groupby(neighbors, op.itemgetter(0)) if k]

输出

['This marble is', 'this marble. Kwoo-oooo-waaa!']

1more_itertools 是一个第三方库,实现了许多有用的工具,包括 itertools recipes

2注意,对于任何带有标点符号的关键字,肯定可以制作更强的谓词,但是为了简单起见,使用了这个。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接