如何在Python中高效地搜索大文本中的相似子字符串?

8

让我尝试通过一个例子来解释我的问题,我有一个大语料库和一个如下的子字符串:

corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""

substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""

子串和语料库非常相似,但并不完全相同。

如果我做这样的事情,

import re
re.search(substring, corpus, flags=re.I) # this will fail substring is not exact but rather very similar

在语料库中,子字符串如下所示,与我拥有的子字符串略有不同,因此正则表达式搜索失败了,是否有人能够建议一个非常好的替代方案来查找类似的子字符串?

until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now

我尝试了difflib库,但它并不能满足我的使用情况。

一些背景信息:

我现在拥有的子字符串是一段时间前使用这个正则表达式re.sub("[^a-zA-Z]", " ", corpus)从预处理语料库中获取的。

但现在我需要使用我拥有的子字符串来进行反向查找语料库文本,并找到语料库中的起始和结束索引。


在这段程序相关的内容中,只需要翻译其文本内容而不进行解释:如果它们仅因特殊字符而不同,则可以删除它们并进行匹配- reduced_string=re.sub("[^A-Z]","",corpus,0,re.IGNORECASE) - Chris
@Chris 我的使用情景是需要在语料库中查找子字符串,而不需要删除语料文本中的特殊字符。我所拥有的子字符串是从经过预处理的语料库中得到的,使用了这个正则表达式 re.sub("[^a-zA-Z]", " ", corpus)。我需要进行反向查找。 - user_12
你可以创建一个特殊字符和它们索引的映射表,然后在获取子字符串时以同样的方式替换它们。搜索子字符串,获取起始和结束索引,然后从映射表中恢复特殊字符。 - Igor Moraru
@IgorMoraru,你能否提供一个使用我的数据的例子来说明如何实现它? - user_12
@user_12,我已经更新了我的回答以适应你编辑后的问题。 - Chris
4个回答

3

实际上,对于给定的示例,您并不需要进行模糊匹配;文本只能在substring内的空格中更改,并且它只能通过添加至少一个非字母字符来更改(可以替换空格,但不能删除空格而不进行替换)。这意味着您可以直接从带有单词之间通配符的substring构造正则表达式,在corpus中进行search(或finditer),并且生成的匹配对象将告诉您匹配开始和结束的位置:

import re

# Allow any character between whitespace-separated "words" except ASCII
# alphabetic characters
ssre = re.compile(r'[^a-z]+'.join(substring.split()), re.IGNORECASE)

if m := ssre.search(corpus):
    print(m.start(), m.end())

    print(repr(m.group(0)))

在线试用!

正确地识别了匹配在corpus中开始的位置(索引217)和结束的位置(索引771);如果您愿意,.group(0)可以直接提取匹配的文本(通常不需要使用索引,因此有很大的可能性是您仅仅为了提取真实文本而请求它们,而.group(0)可以直接完成这个任务)。输出结果如下:

217 771
"until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now"

如果空格可能被删除而不被替换,只需将+量词更改为*(正则表达式运行速度会稍慢,因为它不能像短路那样轻松地处理,但仍然可以工作,并且应该足够快)。
如果您需要处理非ASCII字母字符,则正则表达式连接器可以从r'[^a-z]+'更改为等效的r'[\W\d_]+'(这意味着“匹配所有非单词字符[非字母数字和非下划线],加上数字字符和下划线”);它有点难以阅读,但可以正确处理像é这样的内容(将其视为单词的一部分,而不是连接符)。
虽然它不会像difflib那样灵活,但当您知道没有单词被删除或添加时,这只是一个间距和标点符号的问题,它可以完美地工作,并且应该比真正的模糊匹配解决方案(必须做更多的工作来处理接近匹配的概念)运行得更快。

非常感谢,我认为对于我的用例来说,这似乎是一个有效的方法。我没有删除任何单词,只有标点符号、空格和数字,这个解决方案对我很有效。 - user_12
1
@user_12:太棒了。我使用修改后的正则表达式连接器更新了它,以处理字母字符的 Unicode 概念,而不仅仅是 ASCII;“非字母”字符类相当丑陋([\ W\d_]与更清晰的[^a-z]相比),但它将处理任意字母的 Unicode,即使 Unicode 添加更多脚本。如果您的用例是纯 ASCII,则两者都可以工作,因此由您(和数据集)决定哪个更有意义。 - ShadowRanger

1

虽然如果两个字符串只有一个字符不同,你就无法找到其完全匹配的部分,但你可以找到相似的字符串。

因此,在这里我利用了内置的difflib SequenceMatcher来检查两个不同的字符串的相似性。

如果您需要子字符串在语料库中开始的索引,那么这可以很容易地添加。如有任何问题,请留言。

希望能对你有所帮助。- 适应你编辑后的问题 - 实现了@ShadowRanger的备注

import re
from difflib import SequenceMatcher


def similarity(a, b) -> float:
    """Return similarity between 2 strings"""
    return SequenceMatcher(None, a, b).ratio()


def find_similar_match(a, b, threshold=0.7) -> list:
    """Find string b in a - while the strings being different"""
    corpus_lst = a.split()
    substring_lst = b.split()

    nonalpha = re.compile(r"[^a-zA-Z]")
    start_indices = [i for i, x in enumerate(
        corpus_lst) if nonalpha.sub("", x) == substring_lst[0]]
    end_indices = [i for i, x in enumerate(
        corpus_lst) if nonalpha.sub("", x) == substring_lst[-1]]

    rebuild_substring = " ".join(substring_lst)
    max_sim = 0
    for start_idx in start_indices:
        for end_idx in end_indices:
            corpus_search_string = " ".join(corpus_lst[start_idx: end_idx])
            sim = similarity(corpus_search_string, rebuild_substring)
            if sim > max_sim:
                result = [start_idx, end_idx]

    return result

调用 find_similar_match(corpus, substring) 的结果如下:
Found a match with similarity : 0.8429752066115702
[38, 156]

一则性能提示:对于大量的输入,在Python层多次查找缓存编译的正则表达式(而不是预编译和使用C加速的编译正则表达式)的成本可能会有所差异。您可能希望在函数顶部执行nonalpha = re.compile(r"[^a-zA-Z]"),然后将re.sub("[^a-zA-Z]", "", x)替换为nonalpha.sub("", x)。您还需要将" ".join(substring_lst)移出循环(它永远不会改变,但您可能要重建许多次)。 - ShadowRanger
@Chris 非常感谢。这在我的示例上似乎有效,但我不确定它在更大的数据上效率如何,并且还需要测试它是否在任何情况下失败。我现在会保持这个问题开放,以寻找更高效的方法。 - user_12
完全有道理 @ShadowRanger - 在复制粘贴和测试时没有考虑到这一点 - 谢谢提醒! - Chris
@user_12 很高兴能帮到您 - 如果它按预期工作,请接受答案,这样其他人就可以看到它了 =) - Chris
@user_12 很高兴能够帮助 - 如果它能够按照预期工作,请接受答案,这样其他人就可以看到它了 =) - Chris
1
@Chris 当子字符串和语料库相同时,似乎会出现故障。在一些情况下,我的语料库是一个短句子,而我的子字符串在这些情况下将与语料库相同,因此它会失败,因为它正在寻找最大相似度分数。 - user_12

0

这不是最好的解决方案,但可能会有所帮助。

match = SequenceMatcher(None, corpus, substring).find_longest_match(0, len(corpus), 0, len(substring))

print(match)
print(corpus[match.a:match.a + match.size])
print(substring[match.b:match.b + match.size])

不完全是我想要的,我已经尝试过了。我想找到语料库中子字符串的起始和结束索引。但是你不能使用re.search,因为它不是精确匹配,而是类似子字符串搜索。 - user_12
是的,根据更新后的问题,@Chris 提供了更好的解决方案。 - Hitesh Marwari

0

这可能有助于您根据子字符串中语料库中单词的百分比来可视化这两个字符串的相似性。

字面上翻译:在语料库中是子字符串的单词的百分比方面,这可能有助于您将两个字符串的相似性可视化。

下面的代码旨在:

  • 使用子字符串作为单词
  • 查找语料库中的这些单词(如果找到-使它们大写)
  • 显示语料库中的修改
  • 计算语料库中修改单词的百分比
  • 显示没有在语料库中找到的子字符串中的单词数

通过这种方式,您可以看到哪些子字符串中的单词与语料库中的单词匹配,然后按单词(但不一定按正确顺序)标识相似度的百分比。

代码:

import re
corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""

substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""

sub_list = set(substring.split(" "))
unused_words = []
for word in sub_list:
    if word in corpus:
        r = r"\b" + word + r"\b"
        ru = f"{word.upper()}"
        corpus = re.sub(r, ru, corpus)
    else:
        unused_words.append(word)

print(corpus)

lower_strings = len(re.findall("[a-z']+", corpus))
upper_strings = len(re.findall("[A-Z']+", corpus))
print(f"\nWords Matched = {(upper_strings)/(upper_strings + lower_strings)*100:.1f}%")
print(f"Unused Substring words: {len(unused_words)}")

输出:

very quick service, polite workers(cory, I think THAT'S his name), I
basically just drove there AND got A quote(which SEEMS TO be very fair
priced), THEN DROPPED OFF MY CAR 4 days later(because THEY WERE fully
booked UNTIL THEN), THEN I DROPPED OFF MY CAR ON MY APPOINTMENT DAY, THEN
THE SAME DAY THE SHOP CALLED ME AND NOTIFIED ME THAT THE THE JOB IS DONE I
CAN GO PICKUP MY CAR. WHEN I GO CHECKED OUT MY CAR I WAS AMAZED BY THE JOB
THEY'VE DONE TO IT, AND THEY EVEN GAVE THAT DIRTY CAR A WASH( PROB EVEN
WAXED IT OR COATED IT, CUZ IT WAS SHINY AS HELL), TIRES SHINE, MATS WERE 
VACUUMED TOO. I GAVE THEM A DIRTY, BROKEN CAR, THEY GAVE ME BACK A WHAT 
SEEMS LIKE A BRAND NEW CAR. I'M HAPPY WITH THE RESULT, AND I WILL DEF HAVE 
ALL MY CAR'S WORK DONE BY THIS PLACE FROM NOW.

Words Matched = 82.1%
Unused Substring words: 0

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接