如何在Python中获取两个字符串之间的所有模糊匹配子字符串?

5

假设我有三个示例字符串

text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"

如果我用text1获取text2和text3的所有匹配子字符串,我将得到
text1_text2_common = [
    '3 days ago.',
]

text2_text3_common = [
    'of',
]

text1_text3_common = [
    'was',
    'idx'
    'every'
    'hours'
]

我需要的是一种模糊匹配,例如使用 Levenshtein距离。因此,即使子字符串不完全匹配,只要它们足够相似以满足条件,就会被选为子字符串。
因此,我理想中需要的是像这样的东西:
text1_text3_common_fuzzy = [
    'prescription of idx, 20mg to be given every four hours'
]

1
一个想法:在分隔符字符(空格、标点等)上将字符串拆分为单词,并规范化大小写。根据计算开销,对单个单词、一对、三元组等执行Levenshtein距离。 - Aaron
4个回答

7
以下是计算字符串1的子串与字符串2的全串之间模糊比率相似度的代码。该代码还可以处理字符串2的子串和字符串1的全串以及字符串1的子串和字符串2的子串。
这个代码使用nltk来生成ngrams。
典型算法:
1. 从给定的第一个字符串生成ngrams。 示例: text2 = "The time of discomfort was 3 days ago." total_length = 8
在代码中,参数的值为5、6、7、8。 param = 5 ngrams = ['The time of discomfort was', 'time of discomfort was 3', 'of discomfort was 3 days', 'discomfort was 3 days ago.']
2. 与第二个字符串进行比较。 示例: text1 = 患者因腹痛入院,症状始于3天前。医生开了idx 20 mg每4小时一次。 @param=5
- 将The time of discomfort wastext1进行比较并获取模糊分数; - 将time of discomfort was 3text1进行比较并获取模糊分数; - 以此类推,直到ngrams_5中的所有元素都完成; - 如果模糊分数大于或等于给定阈值,则保存子字符串。
@param=6
- 将The time of discomfort was 3text1进行比较并获取模糊分数; - 以此类推。
直到@param=8
您可以修改n_start为5或其他值,这样字符串1的ngrams将与字符串2的ngrams进行比较,在这种情况下,这是字符串1的子串和字符串2的子串之间的比较。
# Generate ngrams for string2
n_start = 5  # st2_length
for n in range(n_start, st2_length + 1):
    ...

作为比较对象,我使用:

fratio = fuzz.token_set_ratio(fs1, fs2)

也可以看看这里,您也可以尝试不同的比率。

您提供的样本'prescription of idx, 20mg to be given every four hours'具有模糊分数52。

请参考控制台输出示例。

7                    prescription of idx, 20mg to be given every four hours           52

代码

"""
fuzzy_match.py

https://dev59.com/OMTra4cB1Zd3GeqP10Jl

Dependent modules:
    pip install pandas
    pip install nltk
    pip install fuzzywuzzy
    pip install python-Levenshtein

"""


from nltk.util import ngrams
import pandas as pd
from fuzzywuzzy import fuzz


# Sample strings.
text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"


def myprocess(st1: str, st2: str, threshold):
    """
    Generate sub-strings from st1 and compare with st2.
    The sub-strings, full string and fuzzy ratio will be saved in csv file.
    """
    data = []
    st1_length = len(st1.split())
    st2_length = len(st2.split())

    # Generate ngrams for string1
    m_start = 5
    for m in range(m_start, st1_length + 1):  # st1_length >= m_start

        # If m=3, fs1 = 'Patient has checked', 'has checked in', 'checked in for' ...
        # If m=5, fs1 = 'Patient has checked in for', 'has checked in for abdominal', ...
        for s1 in ngrams(st1.split(), m):
            fs1 = ' '.join(s1)
            
            # Generate ngrams for string2
            n_start = st2_length
            for n in range(n_start, st2_length + 1):
                for s2 in ngrams(st2.split(), n):
                    fs2 = ' '.join(s2)

                    fratio = fuzz.token_set_ratio(fs1, fs2)  # there are other ratios

                    # Save sub string if ratio is within threshold.
                    if fratio >= threshold:
                        data.append([fs1, fs2, fratio])

    return data


def get_match(sub, full, colname1, colname2, threshold=50):
    """
    sub: is a string where we extract the sub-string.
    full: is a string as the base/reference.
    threshold: is the minimum fuzzy ratio where we will save the sub string. Max fuzz ratio is 100.
    """   
    save = myprocess(sub, full, threshold)

    df = pd.DataFrame(save)
    if len(df):
        df.columns = [colname1, colname2, 'fuzzy_ratio']

        is_sort_by_fuzzy_ratio_first = True

        if is_sort_by_fuzzy_ratio_first:
            df = df.sort_values(by=['fuzzy_ratio', colname1], ascending=[False, False])
        else:
            df = df.sort_values(by=[colname1, 'fuzzy_ratio'], ascending=[False, False])

        df = df.reset_index(drop=True)

        df.to_csv(f'{colname1}_{colname2}.csv', index=False)

        # Print to console. Show only the sub-string and the fuzzy ratio. High ratio implies high similarity.
        df1 = df[[colname1, 'fuzzy_ratio']]
        print(df1.to_string())
        print()

        print(f'sub: {sub}')
        print(f'base: {full}')
        print()


def main():
    get_match(text2, text1, 'string2', 'string1', threshold=50)  # output string2_string1.csv
    get_match(text3, text1, 'string3', 'string1', threshold=50)

    get_match(text2, text3, 'string2', 'string3', threshold=10)

    # Other param combo.


if __name__ == '__main__':
    main()

控制台输出

                                  string2  fuzzy_ratio
0              discomfort was 3 days ago.           72
1           of discomfort was 3 days ago.           67
2      time of discomfort was 3 days ago.           60
3                of discomfort was 3 days           59
4  The time of discomfort was 3 days ago.           55
5           time of discomfort was 3 days           51

sub: The time of discomfort was 3 days ago.
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

                                                                    string3  fuzzy_ratio
0                                                 be given every four hours           61
1                                    idx, 20mg to be given every four hours           58
2        was given a prescription of idx, 20mg to be given every four hours           56
3                                              to be given every four hours           56
4   John was given a prescription of idx, 20mg to be given every four hours           56
5                                 of idx, 20mg to be given every four hours           55
6              was given a prescription of idx, 20mg to be given every four           52
7                    prescription of idx, 20mg to be given every four hours           52
8            given a prescription of idx, 20mg to be given every four hours           52
9                  a prescription of idx, 20mg to be given every four hours           52
10        John was given a prescription of idx, 20mg to be given every four           52
11                                              idx, 20mg to be given every           51
12                                        20mg to be given every four hours           50

sub: John was given a prescription of idx, 20mg to be given every four hours
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

                                  string2  fuzzy_ratio
0      time of discomfort was 3 days ago.           41
1           time of discomfort was 3 days           41
2                time of discomfort was 3           40
3                of discomfort was 3 days           40
4  The time of discomfort was 3 days ago.           40
5           of discomfort was 3 days ago.           39
6       The time of discomfort was 3 days           39
7              The time of discomfort was           38
8            The time of discomfort was 3           35
9              discomfort was 3 days ago.           34

sub: The time of discomfort was 3 days ago.
base: John was given a prescription of idx, 20mg to be given every four hours

CSV输出示例

string2_string1.csv

enter image description here

使用Spacy相似度

以下是使用Spacy比较text3的子字符串和text1完整文本的结果。

下面的结果可用于与上面的第二个表格进行比较,以查看哪种方法呈现了更好的相似度排名。

我使用大型模型得到下面的结果。

代码

import spacy
import pandas as pd


nlp = spacy.load("en_core_web_lg")

text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"

text3_sub = [
    'be given every four hours', 'idx, 20mg to be given every four hours',
    'was given a prescription of idx, 20mg to be given every four hours',
    'to be given every four hours',
    'John was given a prescription of idx, 20mg to be given every four hours',
    'of idx, 20mg to be given every four hours',
    'was given a prescription of idx, 20mg to be given every four',
    'prescription of idx, 20mg to be given every four hours',
    'given a prescription of idx, 20mg to be given every four hours',
    'a prescription of idx, 20mg to be given every four hours',
    'John was given a prescription of idx, 20mg to be given every four',
    'idx, 20mg to be given every',
    '20mg to be given every four hours'
]


data = []
for s in text3_sub:
    doc1 = nlp(s)
    doc2 = nlp(text1)
    sim = round(doc1.similarity(doc2), 3)
    data.append([s, text1, sim])

df = pd.DataFrame(data)
df.columns = ['from text3', 'text1', 'similarity']
df = df.sort_values(by=['similarity'], ascending=[False])
df = df.reset_index(drop=True)

df1 = df[['from text3', 'similarity']]
print(df1.to_string())

print()
print(f'text3: {text3}')
print(f'text1: {text1}')

输出

                                                                 from text3  similarity
0        was given a prescription of idx, 20mg to be given every four hours       0.904
1   John was given a prescription of idx, 20mg to be given every four hours       0.902
2                  a prescription of idx, 20mg to be given every four hours       0.895
3                    prescription of idx, 20mg to be given every four hours       0.893
4            given a prescription of idx, 20mg to be given every four hours       0.892
5                                 of idx, 20mg to be given every four hours       0.889
6                                    idx, 20mg to be given every four hours       0.883
7              was given a prescription of idx, 20mg to be given every four       0.879
8         John was given a prescription of idx, 20mg to be given every four       0.877
9                                         20mg to be given every four hours       0.877
10                                              idx, 20mg to be given every       0.835
11                                             to be given every four hours       0.834
12                                                be given every four hours       0.832

text3: John was given a prescription of idx, 20mg to be given every four hours
text1: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

看起来spacy方法产生了一个很好的相似度排名。


1
也许你可以将此作为解决方案的入口。你可以将文本划分为单词并将它们分组,然后将它们与其他子字符串(相同大小)进行比较,并返回高于特定比率的结果。我移除了逗号和句点,因为它们对我来说不重要。你可以使用任何其他比较工具代替 SequenceMatcher (在某些工具中,你不需要将两侧都划分为相等大小的子字符串。你可以看看 fuzzywuzzy)。你必须玩弄比率,以获得想要的结果。此外,你必须查看结果以删除是其他结果的子字符串的结果。这取决于你的需求:
from difflib import SequenceMatcher

text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"


def fuzzy_match(t1: str, t2: str, min_len: int = 3, max_len: int = 10, ratio_to_get: int = 0.6):
    t1 = t1.replace(".", "").replace(",", "")
    t2 = t2.replace(".", "").replace(",", "")

    result = set()
    t2_splitted = t2.split(" ")
    t1_splitted = t1.split(" ")
    for count in range(min_len, max_len, 1):
        for pos_2 in range(len(t2_splitted) - count + 1):
            substr_2 = " ".join(t2_splitted[pos_2: pos_2 + count])
            for pos_1 in range(len(t1_splitted) - count + 1):
                substr_1 = " ".join(t1_splitted[pos_1: pos_1 + count])
                ratio = SequenceMatcher(None, substr_1, substr_2).ratio()
                if ratio >= ratio_to_get:
                    result.add(substr_1)

    return result


if __name__ == '__main__':
    print(fuzzy_match(text1, text2))
    print(fuzzy_match(text2, text1))
    print(fuzzy_match(text1, text3))
    print(fuzzy_match(text3, text1))
    print(fuzzy_match(text2, text3))
    print(fuzzy_match(text3, text2))

输出:

{'days ago Patient', '3 days ago', 'started 3 days ago', 'which started 3 days ago', '3 days ago Patient', 'started 3 days'}
{'was 3 days', '3 days ago', 'discomfort was 3 days ago', 'was 3 days ago'}
{'prescribed idx 20 mg', 'mg every 4 hours', 'idx 20 mg every 4', '20 mg every 4 hours', 'Patient was prescribed idx 20 mg every 4', 'ago Patient was prescribed idx 20 mg every 4', 'ago Patient was prescribed idx 20 mg', 'was prescribed idx 20', 'ago Patient was prescribed idx 20', 'ago Patient was prescribed idx 20 mg every', 'prescribed idx 20', 'was prescribed idx 20 mg every', 'Patient was prescribed idx 20 mg', 'ago Patient was prescribed idx', 'was prescribed idx 20 mg', 'Patient was prescribed', 'prescribed idx 20 mg every', 'idx 20 mg', 'Patient was prescribed idx 20 mg every', 'was prescribed idx 20 mg every 4', 'idx 20 mg every', 'mg every 4', '20 mg every', 'Patient was prescribed idx 20', 'prescribed idx 20 mg every 4', 'every 4 hours'}
{'of idx 20mg to', 'given a prescription of idx 20mg to be', 'idx 20mg to be given', 'given a prescription of idx', 'be given every', 'idx 20mg to', 'every four hours', 'given every four', 'given a prescription of idx 20mg to', 'prescription of idx 20mg to be', 'a prescription of idx 20mg', 'prescription of idx 20mg to', 'prescription of idx 20mg', 'given a prescription of idx 20mg to be given', 'a prescription of idx 20mg to be', 'idx 20mg to be', 'given a prescription', 'to be given every four hours', 'be given every four', 'given every four hours', 'a prescription of idx 20mg to', 'be given every four hours', 'given a prescription of idx 20mg', 'a prescription of idx', 'prescription of idx', 'of idx 20mg'}
set()
set()

1

简短回答:

def match(a,b):
    a,b = a.lower(), b.lower()
    error = 0
    for i in string.ascii_lowercase:
            error += abs(a.count(i) - b.count(i))
    total = len(a) + len(b)
    return (total-error)/total
def what_you_need(s1, s2, fuziness=0.8):
    if match(s1, s2) < fuziness:
        syms = []
        for s in s1:
            if s in s2:
                syms.append(s)
        return "".join(syms)

1
这个问题最有效的算法是构建有限状态转换器并利用它进行模糊匹配。
然而,还有其他更简单的方法,比如前缀树。
请查看以下链接,以阅读更有知识的人对此主题进行广泛撰写的文章。 Peter Norvig的《如何编写拼写纠正程序》 包含了在类似于Google搜索栏这样的东西中实现这个的例子,该程序可以自动纠正拼写错误,并提供了其他语言的示例链接。 Kay Schlühr的 关于模糊字符串匹配的文章 对这个主题进行了更深入的探讨,并且质量比我能够提供的高得多。
如果你更感兴趣于 ElasticSearch 后台库 Lucene 是如何实现这一点的,可以参考 Michael McCandless 关于 FSTs 的 博客文章 和有关该主题的 学术论文
如果你正在寻找(当前的)最前沿技术,请查看 pisa,其中包含最先进的信息检索算法。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接