如何查找子字符串的所有出现？

Question

如何查找子字符串的所有出现？

581

Python有 string.find() 和 string.rfind() 方法可以在字符串中寻找子字符串并返回它的索引。

我想知道是否有像 string.find_all() 这样的方法可以返回所有找到的索引（不仅仅是从开头开始或者从末尾开始的第一个）。

例如：

string = "test test test test"

print string.find('test') # 0
print string.rfind('test') # 15

#this is the goal
print string.find_all('test') # [0,5,10,15]

_{如需计算字符串中子串出现的次数，请参阅计算字符串中子串的出现次数。}

- nukl

20

'ttt'.find_all('tt')应该返回一个错误，因为在Python中字符串对象没有名为find_all()的方法。 - Santiago Alessandri

4

它应该返回'0'。当然，在完美的世界中也必须有'ttt'.rfind_all('tt')，它应该返回'1'。 - nukl

32个回答

174

>>> help(str.find)
Help on method_descriptor:

find(...)
    S.find(sub [,start [,end]]) -> int

因此，我们可以自己构建它：

def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1: return
        yield start
        start += len(sub) # use start += 1 to find overlapping matches

list(find_all('spam spam spam spam', 'spam')) # [0, 5, 10, 15]

不需要使用临时字符串或正则表达式。

- Karl Knechtel

27

为了获得重叠匹配，只需将 start += len(sub) 替换为 start += 1 即可。 - Karl Knechtel

5

我认为您之前的评论应该作为您回答的附言。 - tzot

1

你的代码在查找 "GATATATGCATATACTT" 中的子串 "ATAT" 时无法正常工作。 - Ashish Negi

2

请看我附加的注释。这是一个重叠匹配的例子。 - Karl Knechtel

7

为了与re.findall的行为相匹配，我建议在len(sub)之前加上len(sub) or 1，否则这个生成器将永远无法在空子字符串上终止。 - WGH

我个人认为应该用a_str.index替换a_str.find，这样就不需要使用return了。 - user7050005

85

这是一种（非常低效的）方法，可以获取所有（即使是重叠的）匹配项：

>>> string = "test test test test"
>>> [i for i in range(len(string)) if string.startswith('test', i)]
[0, 5, 10, 15]

这个解决方案也适用于多词子词。

s = "Find THIS SUB-WORD in this sentence with THIS SUB-WORD"
sub = "THIS SUB-WORD"
[i for i in range(len(s)) if s.startswith(sub, I)]
# [5, 41]

- thkala

如果我们想要使用一个for循环来检查许多字符，该怎么做呢？使用这段代码，我将会有很多for循环，时间顺序太高。 - Prof.Plague

3

@thkala 非常聪明的方法，没有使用re模块来执行操作。感谢您的回答！ - Cute Panda

我认为我更喜欢这个答案，因为它不需要re模块。 - Shen

谢谢，这对我有用。被接受的答案不能处理多个单词的子词。例如，在这个句子中找到THIS SUB-WORD，在这个句子中与THIS SUB-WORD一起使用。 - Abu Shoeb

76

使用re.finditer:

import re
sentence = input("Give me a sentence ")
word = input("What word would you like to find ")
for match in re.finditer(word, sentence):
    print (match.start(), match.end())

对于 word = "this" 和 sentence = "this is a sentence this this"，将产生以下输出：

(0, 4)
(19, 23)
(24, 28)

- Idos

7

值得指出的是，它仅适用于“非重叠匹配”，因此无法处理以下情况：sentence="ababa"和word="aba"。 - AnukuL

如果单词包含正则表达式中有意义的任何字符，则此操作将失败。 - mousetail

66

虽然这个帖子有点老了，但下面是我使用生成器和普通的str.find方法解决的方案。

def findall(p, s):
    '''Yields all the positions of
    the pattern p in the string s.'''
    i = s.find(p)
    while i != -1:
        yield i
        i = s.find(p, i+1)

例子

x = 'banananassantana'
[(i, x[i:i+2]) for i in findall('na', x)]

返回值

[(2, 'na'), (4, 'na'), (6, 'na'), (14, 'na')]

- AkiRoss

4

这看起来很漂亮！ - fabio.sang

5

经测试，使用str.find的解决方案比使用re.finditer两倍更快：在我的机器上，前者是“310 ns ± 5.35 ns per loop”，而后者是“799 ns ± 5.72 ns per loop”。这证实了我过去注意到的一点：内置字符串方法通常比正则表达式更快（嵌套的str.replace和re.sub也是如此）。 - Jean Monet

2

最美的解决方案。请注意，可以通过引入可选参数 overlapping=True 并将 i+1 替换为 i + (1 if overlapping else len(p)) 来轻松地进行泛化。 - Hugues

25

您可以使用re.finditer()来进行非重叠匹配。

>>> import re
>>> aString = 'this is a string where the substring "is" is repeated several times'
>>> print [(a.start(), a.end()) for a in list(re.finditer('is', aString))]
[(2, 4), (5, 7), (38, 40), (42, 44)]

但不能适用于：

In [1]: aString="ababa"

In [2]: print [(a.start(), a.end()) for a in list(re.finditer('aba', aString))]
Output: [(0, 3)]

- Chinmay Kanchi

14

为什么要把迭代器转换成列表呢？这只会减慢处理速度。 - pradyunsg

2

aString VS astring ;) - NexD.

22

来吧，让我们一起递归。

def locations_of_substring(string, substring):
    """Return a list of locations of a substring."""

    substring_length = len(substring)    
    def recurse(locations_found, start):
        location = string.find(substring, start)
        if location != -1:
            return recurse(locations_found + [location], location+substring_length)
        else:
            return locations_found

    return recurse([], 0)

print(locations_of_substring('this is a test for finding this and this', 'this'))
# prints [0, 27, 36]

这种方法不需要使用正则表达式。

- Cody Piersall

我刚开始想知道“在Python中是否有一种花哨的方法来定位字符串中的子字符串”……然后经过5分钟的谷歌搜索，我找到了你的代码。感谢您的分享！ - Geparada

4

这段代码存在几个问题。由于它在开放式数据上运行，如果出现足够多的情况，迟早会遇到“RecursionError”。另一个问题是在每次迭代时创建了两个丢弃的列表，仅为附加一个元素而存在，这对于查找字符串的函数来说非常低效，可能会被频繁地调用。虽然有时递归函数看起来优雅清晰，但应该谨慎使用。 - Ivan Nikolaev

13

如果您只是寻找单个字符，可以使用以下代码：

string = "dooobiedoobiedoobie"
match = 'o'
reduce(lambda count, char: count + 1 if char == match else count, string, 0)
# produces 7

此外，

string = "test test test test"
match = "test"
len(string.split(match)) - 1
# produces 4

我猜这两个（尤其是#2）的性能都不太好。

- jstaab

1

很棒的解决方案..我对使用split()方法印象深刻 - shantanu pathak

12

这是一个旧的帖子，但我非常感兴趣并想分享我的解决方案。

def find_all(a_string, sub):
    result = []
    k = 0
    while k < len(a_string):
        k = a_string.find(sub, k)
        if k == -1:
            return result
        else:
            result.append(k)
            k += 1 #change to k += len(sub) to not search overlapping results
    return result

它应该返回一个子字符串被找到的位置列表。如果你看到错误或有改进的空间，请发表评论。

- Thurines

9

这是使用re.finditer的技巧，对我很有帮助。

import re

text = 'This is sample text to test if this pythonic '\
       'program can serve as an indexing platform for '\
       'finding words in a paragraph. It can give '\
       'values as to where the word is located with the '\
       'different examples as stated'

#  find all occurances of the word 'as' in the above text

find_the_word = re.finditer('as', text)

for match in find_the_word:
    print('start {}, end {}, search string \'{}\''.
          format(match.start(), match.end(), match.group()))

- Bruno Vermeulen

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- moinudin · Accepted Answer

没有内置的简单字符串函数能够满足您的要求，但您可以使用更强大的正则表达式：

import re
[m.start() for m in re.finditer('test', 'test test test test')]
#[0, 5, 10, 15]

如果您想找到重叠的匹配项，使用前瞻可以实现：

[m.start() for m in re.finditer('(?=tt)', 'ttt')]
#[0, 1]

如果你想要找出所有不重叠的反向匹配，你可以将正向预查和负向预查结合起来，构造一个如下的表达式：

search = 'tt'
[m.start() for m in re.finditer('(?=%s)(?!.{1,%d}%s)' % (search, len(search)-1, search), 'ttt')]
#[1]

re.finditer返回一个生成器，所以你可以将上面的[]更改为()，以获得一个生成器而不是列表。如果你只需遍历结果一次，这将更加高效。