在字符串中查找重复的字符组合

9

我有一个字符串,它包含一个非常长的句子,没有空格。

mystring = "abcdthisisatextwithsampletextforasampleabcd"

我希望找到所有包含至少4个字符的重复子字符串。

因此,我想要实现以下功能:

'text' 2 times
'sample' 2 times
'abcd' 2 times
作为在“mystring”中可以找到两次的“abcd”,“text”和“sample”的子字符串,它们被认为是具有超过4个字符长度的匹配子字符串。重要的是我正在寻找重复的子字符串,而不是仅查找现有的英文单词。 我发现的答案有助于在带有空格的文本中查找重复项,但是我找不到一种适当的资源来覆盖没有空格和空格的情况。最有效的方法是如何完成呢?

7
请勿发布面试问题! - Hari_pb
@Harry_pb 这个问题有什么问题吗? - rihekopo
2
仅仅想看到你尝试解决问题的过程,而不是问题本身。顺便说一句,我没有给你点踩,但我相信每个人都应该展示他们为了得到正确答案所做的尝试,而不仅仅是提出问题。 - Hari_pb
3
很多人花了很大的力气来解决你的面试问题,所以你有很大的机会因为他们而被雇用。请不要忘记给予他们应得的奖励,因为他们做了所有的工作。 - Dominique
1
@Dominique 我怀疑面试官们没有给予OP三个月的时间来回答这个问题;-) - snakecharmerb
11个回答

阿里云服务器只需要99元/年,新老用户同享,点击查看详情
13

让我们一步一步地来。您需要处理以下几个子任务:

  1. 识别长度为4或更长的所有子字符串。
  2. 计算这些子字符串的出现次数。
  3. 过滤掉出现2次或更多次的所有子字符串。

实际上,您可以将所有任务放在几个语句中。为了易于理解,最好逐个任务来处理。

以下示例都使用

mystring = "abcdthisisatextwithsampletextforasampleabcd"
min_length = 4

1. 给定长度的子字符串

您可以通过切片轻松获得子字符串 - 例如,mystring[4:4+6]将给您长度为6的从位置4开始的子字符串:'thisis'。更一般地,您需要形如mystring[start:start+length]的子字符串。

那么,startlength需要什么值?

  • start必须...
    • 包括第一个字符,以覆盖所有子字符串:start in range(0, ...)
    • 不映射到短的子字符串,因此可以在结尾前停止min_length个字符:start in range(..., len(mystring) - min_length + 1)
  • length必须...
    • 覆盖长度为4的最短子字符串:length in range(min_length, ...)
    • 不超过索引i之后的剩余字符串:length in range(..., len(mystring) - i + 1))

+1的项来自将长度(≥1)转换为索引(≥0)。您可以将所有内容放入单个表达式中:

substrings = [
    mystring[i:i+j]
    for i in range(0, len(mystring) - min_length + 1)
    for j in range(min_length, len(mystring) - i + 1)
]

2. 计算子字符串数量

显然,你想为每个子字符串保持一个计数。为每个特定对象保留 任何东西 都是 dict 设计的初衷。所以你应该在 dict 中使用子字符串作为键,计数作为值。本质上,这对应于以下内容:

counts = {}
for substring in substrings:
    try:  # increase count for existing keys, set for new keys
         counts[substring] += 1
    except KeyError:
         counts[substring] = 1

您可以将您的子字符串简单地输入到collections.Counter中,它会生成类似于上面的内容。

>>> counts = collections.Counter(substrings)
>>> print(counts)
Counter({'abcd': 2, 'abcdt': 1, 'abcdth': 1, 'abcdthi': 1, 'abcdthis': 1, ...})

注意重复出现的'abcd'如何映射到数量为2。

3. 过滤重复的子字符串

现在您有了每个子字符串及其计数。您需要删除非重复的子字符串-即其中计数为1的子字符串。

Python提供了几种构造用于过滤,具体取决于您想要的输出。如果counts是一个常规的dict,这些也可以工作:

>>> list(filter(lambda key: counts[key] > 1, counts))
['abcd', 'text', 'samp', 'sampl', 'sample', 'ampl', 'ample', 'mple']
>>> {key: value for key, value in counts.items() if value > 1}
{'abcd': 2, 'ampl': 2, 'ample': 2, 'mple': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'text': 2}

使用Python基本数据类型

Python提供了一些基本数据类型,让你能够更高效地完成这个任务。

  1. 使用生成器构建子字符串。生成器会即时生成其成员,所以你永远不需要将它们全部存储在内存中。在你的应用场景中,可以使用生成器表达式:

 substrings = (
     mystring[i:i+j]
     for i in range(0, len(mystring) - min_length + 1)
     for j in range(min_length, len(mystring) - i + 1)
 )
使用现有的计数器实现。Python提供了一个类似于dict的容器,用于计算其成员数量:collections.Counter可以直接处理您的子字符串生成器。特别是在较新的版本中,这种方法更加高效。
 counts = collections.Counter(substrings)
您可以利用 Python 的惰性过滤器,仅检查一个子字符串。内置的 filter 函数或其他生成器表达式可以一次生成一个结果,而无需在内存中存储它们所有的内容。
 for substring in filter(lambda key: counts[key] > 1, counts):
     print(substring, 'occurs', counts[substring], 'times')

1
应该是:for i in range(0, len(mystring) - min_length + 1),否则它将无法获取字符串中的最后一个子字符串--您可以在示例中检查它是否不起作用,因为它无法找到子字符串abcd。 - MariusSiuram
或者,您可以使用滚动哈希来查找给定长度的重复子字符串。 - Anderson Green

6
没人在使用re!现在是利用正则表达式内置模块的时候了;)
import re
寻找所有重复的极长子串 该技术与IT相关。
repeated_ones = set(re.findall(r"(.{4,})(?=.*\1)", mystring))

该匹配项是查找最长的子字符串,该子字符串至少在(无需消耗)一次重复。因此,它会找到所有重复的不相交的子字符串,并仅产生最长的字符串。

查找所有重复的子字符串,包括重叠部分

mystring_overlap = "abcdeabcdzzzzbcde"
# In case we want to match both abcd and bcde
repeated_ones = set()
pos = 0

while True:
    match = re.search(r"(.{4,}).*(\1)+", mystring_overlap[pos:])
    if match:
        repeated_ones.add(match.group(1))
        pos += match.pos + 1
    else:
        break

这可以确保返回所有重复的子字符串,而不仅仅是不相交的子字符串。虽然速度会慢很多,但可以完成工作。

如果你想要除了最长的重复字符串之外,所有的子字符串,那么:

base_repetitions = list(repeated_ones)

for s in base_repetitions:
    for i in range(4, len(s)):
        repeated_ones.add(s[:i])
这将确保对于具有重复的长子字符串,您也拥有较小的子字符串——例如,由 re.search 代码找到的 "sample" 和 "ample"; 但也包括上述代码片段添加的 "samp"、"sampl"、"ampl"。

计算匹配次数

因为(按设计)我们计算的子字符串是不重叠的,所以使用 count 方法是最好的选择:

from __future__ import print_function
for substr in repeated_ones:
    print("'%s': %d times" % (substr, mystring.count(substr)))

结果

查找最大子字符串:

使用问题中原始的mystring

{'abcd', 'text', 'sample'}
使用 mystring_overlap 示例:
{'abcd'}

查找所有子字符串:

使用问题中原始的 mystring

{'abcd', 'ample', 'mple', 'sample', 'text'}

如果我们添加获取所有子字符串的代码,那么我们当然会得到绝对所有的子字符串:

{'abcd', 'ampl', 'ample', 'mple', 'samp', 'sampl', 'sample', 'text'}
使用mystring_overlap示例:
{'abcd', 'bcde'}

未来工作

可以通过以下步骤过滤“查找所有子字符串”的结果:

  • 选择匹配项“A”
  • 检查此匹配项是否为另一个匹配项“B”的子字符串
  • 如果有“B”匹配项,则检查该匹配项“B_n”的计数器
  • 如果“A_n = B_n”,则删除A
  • 返回第一步

不可能出现“A_n < B_n”的情况,因为A比B小(是子字符串),所以必须至少重复相同的次数。

如果“A_n > B_n”,则意味着较小子字符串存在一些额外的匹配项,因此它是一个独特的子字符串,因为它在B未重复的地方重复出现。


4

脚本(必要时在注释中进行解释):

from collections import Counter

mystring = "abcdthisisatextwithsampletextforasampleabcd"
mystring_len = len(mystring)

possible_matches = []
matches = []

# Range `start_index` from 0 to 3 from the left, due to minimum char count of 4
for start_index in range(0, mystring_len-3):
    # Start `end_index` at `start_index+1` and range it throughout the rest of
    # the string
    for end_index in range(start_index+1, mystring_len+1):
        current_string = mystring[start_index:end_index]
        if len(current_string) < 4: continue # Skip this interation, if len < 4
        possible_matches.append(mystring[start_index:end_index])

for possible_match, count in Counter(possible_matches).most_common():
    # Iterate until count is less than or equal to 1 because `Counter`'s
    # `most_common` method lists them in order. Once 1 (or less) is hit, all
    # others are the same or lower.
    if count <= 1: break
    matches.append((possible_match, count))

for match, count in matches:
    print(f'\'{match}\' {count} times')

输出:

'abcd' 2 times
'text' 2 times
'samp' 2 times
'sampl' 2 times
'sample' 2 times
'ampl' 2 times
'ample' 2 times
'mple' 2 times

4

以下是适用于Python3的解决方案:

from collections import Counter

min_str_length = 4
mystring = "abcdthisisatextwithsampletextforasampleabcd"

all_substrings =[mystring[start_index:][:end_index + 1] for start_index in range(len(mystring)) for end_index in range(len(mystring[start_index:]))]
counted_substrings = Counter(all_substrings)
not_counted_final_candidates = [item[0] for item in counted_substrings.most_common() if item[1] > 1 and len(item[0]) >= min_str_length]
counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}
print(counted_final_candidates)

奖励:最长字符串

sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in not_counted_final_candidates if substring1!=substring2 and substring1 in substring2    ]
largest_common_string = list(set(not_counted_final_candidates) - set(sub_sub_strings))

一切皆为函数:

from collections import Counter
def get_repeated_strings(input_string, min_str_length = 2, calculate_largest_repeated_string = True ):

    all_substrings = [input_string[start_index:][:end_index + 1]
                      for start_index in range(len(input_string))
                      for end_index in range(len(input_string[start_index:]))]
    counted_substrings = Counter(all_substrings)
    not_counted_final_candidates = [item[0]
                                    for item in counted_substrings.most_common()
                                    if item[1] > 1 and len(item[0]) >= min_str_length]
    counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}

    ### This is just a bit of bonus code for calculating the largest repeating sting 

    if calculate_largest_repeated_string == True:
        sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in
                       not_counted_final_candidates if substring1 != substring2 and substring1 in substring2]
        largest_common_strings = list(set(not_counted_final_candidates) - set(sub_sub_strings))

        return counted_final_candidates, largest_common_strings
    else:
        return counted_final_candidates

例子:

mystring = "abcdthisisatextwithsampletextforasampleabcd"
print(get_repeated_strings(mystring, min_str_length= 4))

输出:

({'abcd': 2, 'text': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'ampl': 2, 'ample': 2, 'mple': 2}, ['abcd', 'text', 'sample'])

4

代码:

pattern = "abcdthisisatextwithsampletextforasampleabcd"

string_more_4 = []
k = 4
while(k <= len(pattern)):
    for i in range(len(pattern)):
        if pattern[i:k+i] not in string_more_4 and len(pattern[i:k+i]) >= 4:
            string_more_4.append( pattern[i:k+i])
    k+=1

for i in string_more_4:
    if pattern.count(i) >= 2:
        print(i + " -> " +  str(pattern.count(i)) + " times")

输出:

abcd -> 2 times
text -> 2 times
samp -> 2 times
ampl -> 2 times
mple -> 2 times
sampl -> 2 times
ample -> 2 times
sample -> 2 times
希望这有所帮助,因为我的代码长度很短,容易理解。干杯!

3

由于我目前不使用Python 3,因此以下内容适用于Python 2。因此,您需要自行将其适应Python 3。

#!python2

# import module
from collections import Counter

# get the indices
def getIndices(length):
    # holds the indices
    specific_range = []; all_sets = []

    # start building the indices
    for i in range(0, length - 2):

        # build a set of indices of a specific range
        for j in range(1, length + 2):
            specific_range.append([j - 1, j + i + 3])

            # append 'specific_range' to 'all_sets', reset 'specific_range'
            if specific_range[j - 1][1] == length:
                all_sets.append(specific_range)
                specific_range = []
                break

    # return all of the calculated indices ranges
    return all_sets

# store search strings
tmplst = []; combos = []; found = []

# string to be searched
mystring = "abcdthisisatextwithsampletextforasampleabcd"
# mystring = "abcdthisisatextwithtextsampletextforasampleabcdtext"

# get length of string
length = len(mystring)

# get all of the indices ranges, 4 and greater
all_sets = getIndices(length)

# get the search string combinations
for sublst in all_sets:
    for subsublst in sublst:
        tmplst.append(mystring[subsublst[0]: subsublst[1]])
    combos.append(tmplst)
    tmplst = []

# search for matching string patterns
for sublst in all_sets:
    for subsublst in sublst:
        for sublstitems in combos:
            if mystring[subsublst[0]: subsublst[1]] in sublstitems:
                found.append(mystring[subsublst[0]: subsublst[1]])

# make a dictionary containing the strings and their counts
d1 = Counter(found)

# filter out counts of 2 or more and print them
for k, v in d1.items():
    if v > 1:
        print k, v

3
$ cat test.py

import collections
import sys 


S = "abcdthisisatextwithsampletextforasampleabcd"


def find(s, min_length=4):
    """ 
    Find repeated character sequences in a provided string.

    Arguments:
    s -- the string to be searched
    min_length -- the minimum length of the sequences to be found
    """
    counter = collections.defaultdict(int)
    # A repeated sequence can't be longer than half the length of s
    sequence_length = len(s) // 2
    # populate counter with all possible sequences
    while sequence_length >= min_length:
        # Iterate over the string until the number of remaining characters is 
        # fewer than the length of the current sequence.
        for i, x in enumerate(s[:-(sequence_length - 1)]):
            # Window across the string, getting slices
            # of length == sequence_length. 
            candidate = s[i:i + sequence_length]
            counter[candidate] += 1
        sequence_length -= 1

    # Report.
    for k, v in counter.items():
        if v > 1:
            print('{} {} times'.format(k, v)) 
    return



if __name__ == '__main__':
    try:
        s = sys.argv[1]
    except IndexError:
        s = S 
    find(s)

$ python test.py

sample 2 times
sampl 2 times
ample 2 times
abcd 2 times
text 2 times
samp 2 times
ampl 2 times
mple 2 times

2
这是我解决这个问题的方法:
def get_repeated_words(string, minimum_len):

    # Storing count of repeated words in this dictionary
    repeated_words = {}

    # Traversing till last but 4th element
    # Actually leaving `minimum_len` elements at end (in this case its 4)
    for i in range(len(string)-minimum_len):

        # Starting with a length of 4(`minimum_len`) and going till end of string
        for j in range(i+minimum_len, len(string)):

            # getting the current word
            word = string[i:j]

            # counting the occurrences of the word
            word_count = string.count(word)

            if word_count > 1:

                # storing in dictionary along with its count if found more than once
                repeated_words[word] = word_count

    return repeated_words

if __name__ == '__main__':              
    mystring = "abcdthisisatextwithsampletextforasampleabcd"
    result = get_repeated_words(mystring, 4)

2
这里是一个使用more_itertools的简单解决方案。 给定
import collections as ct

import more_itertools as mit


s = "abcdthisisatextwithsampletextforasampleabcd"
lbound, ubound = len("abcd"), len(s)

Code

windows = mit.flatten(mit.windowed(s, n=i) for i in range(lbound, ubound))
filtered = {"".join(k): v for k, v in ct.Counter(windows).items() if v > 1}
filtered

输出

{'abcd': 2,
 'text': 2,
 'samp': 2,
 'ampl': 2,
 'mple': 2,
 'sampl': 2,
 'ample': 2,
 'sample': 2}

细节: 步骤如下: 1. 构建不同大小的 滑动窗口,其中 lbound <= n < ubound 2. 统计所有出现次数并过滤重复项 more_itertools 是第三方包,可以通过 > pip install more_itertools 安装。

2

以下是我会做的方法,但我不知道其他方法:

string = "abcdthisisatextwithsampletextforasampleabcd"
l = len(string)
occurences = {}
for i in range(4, l):
  for start in range(l - i):
    substring = string[start:start + i]
    occurences[substring] = occurences.get(substring, 0) + 1
for key in occurences.keys():
  if occurences[key] > 1:
    print("'" + key + "'", str(occurences[key]), "times")

输出:

'sample' 2 times
'ampl' 2 times
'sampl' 2 times
'ample' 2 times
'samp' 2 times
'mple' 2 times
'text' 2 times

虽然不是很高效,但易于理解。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,