以下是计算字符串1的子串与字符串2的全串之间模糊比率相似度的代码。该代码还可以处理字符串2的子串和字符串1的全串以及字符串1的子串和字符串2的子串。
这个代码使用nltk来生成ngrams。
典型算法:
1. 从给定的第一个字符串生成ngrams。
示例:
text2 = "The time of discomfort was 3 days ago."
total_length = 8
在代码中,参数的值为5、6、7、8。
param = 5
ngrams = ['The time of discomfort was', 'time of discomfort was 3',
'of discomfort was 3 days', 'discomfort was 3 days ago.']
2. 与第二个字符串进行比较。
示例:
text1 =
患者因腹痛入院,症状始于3天前。医生开了idx 20 mg每4小时一次。
@param=5
- 将
The time of discomfort was
与
text1
进行比较并获取模糊分数;
- 将
time of discomfort was 3
与
text1
进行比较并获取模糊分数;
- 以此类推,直到ngrams_5中的所有元素都完成;
- 如果模糊分数大于或等于给定阈值,则保存子字符串。
@param=6
- 将
The time of discomfort was 3
与
text1
进行比较并获取模糊分数;
- 以此类推。
直到@param=8
您可以修改n_start为5或其他值,这样字符串1的ngrams将与字符串2的ngrams进行比较,在这种情况下,这是字符串1的子串和字符串2的子串之间的比较。
n_start = 5
for n in range(n_start, st2_length + 1):
...
作为比较对象,我使用:
fratio = fuzz.token_set_ratio(fs1, fs2)
也可以看看这里,您也可以尝试不同的比率。
您提供的样本'prescription of idx, 20mg to be given every four hours'
具有模糊分数52。
请参考控制台输出示例。
7 prescription of idx, 20mg to be given every four hours 52
代码
"""
fuzzy_match.py
https://dev59.com/OMTra4cB1Zd3GeqP10Jl
Dependent modules:
pip install pandas
pip install nltk
pip install fuzzywuzzy
pip install python-Levenshtein
"""
from nltk.util import ngrams
import pandas as pd
from fuzzywuzzy import fuzz
text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"
def myprocess(st1: str, st2: str, threshold):
"""
Generate sub-strings from st1 and compare with st2.
The sub-strings, full string and fuzzy ratio will be saved in csv file.
"""
data = []
st1_length = len(st1.split())
st2_length = len(st2.split())
m_start = 5
for m in range(m_start, st1_length + 1):
for s1 in ngrams(st1.split(), m):
fs1 = ' '.join(s1)
n_start = st2_length
for n in range(n_start, st2_length + 1):
for s2 in ngrams(st2.split(), n):
fs2 = ' '.join(s2)
fratio = fuzz.token_set_ratio(fs1, fs2)
if fratio >= threshold:
data.append([fs1, fs2, fratio])
return data
def get_match(sub, full, colname1, colname2, threshold=50):
"""
sub: is a string where we extract the sub-string.
full: is a string as the base/reference.
threshold: is the minimum fuzzy ratio where we will save the sub string. Max fuzz ratio is 100.
"""
save = myprocess(sub, full, threshold)
df = pd.DataFrame(save)
if len(df):
df.columns = [colname1, colname2, 'fuzzy_ratio']
is_sort_by_fuzzy_ratio_first = True
if is_sort_by_fuzzy_ratio_first:
df = df.sort_values(by=['fuzzy_ratio', colname1], ascending=[False, False])
else:
df = df.sort_values(by=[colname1, 'fuzzy_ratio'], ascending=[False, False])
df = df.reset_index(drop=True)
df.to_csv(f'{colname1}_{colname2}.csv', index=False)
df1 = df[[colname1, 'fuzzy_ratio']]
print(df1.to_string())
print()
print(f'sub: {sub}')
print(f'base: {full}')
print()
def main():
get_match(text2, text1, 'string2', 'string1', threshold=50)
get_match(text3, text1, 'string3', 'string1', threshold=50)
get_match(text2, text3, 'string2', 'string3', threshold=10)
if __name__ == '__main__':
main()
控制台输出
string2 fuzzy_ratio
0 discomfort was 3 days ago. 72
1 of discomfort was 3 days ago. 67
2 time of discomfort was 3 days ago. 60
3 of discomfort was 3 days 59
4 The time of discomfort was 3 days ago. 55
5 time of discomfort was 3 days 51
sub: The time of discomfort was 3 days ago.
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.
string3 fuzzy_ratio
0 be given every four hours 61
1 idx, 20mg to be given every four hours 58
2 was given a prescription of idx, 20mg to be given every four hours 56
3 to be given every four hours 56
4 John was given a prescription of idx, 20mg to be given every four hours 56
5 of idx, 20mg to be given every four hours 55
6 was given a prescription of idx, 20mg to be given every four 52
7 prescription of idx, 20mg to be given every four hours 52
8 given a prescription of idx, 20mg to be given every four hours 52
9 a prescription of idx, 20mg to be given every four hours 52
10 John was given a prescription of idx, 20mg to be given every four 52
11 idx, 20mg to be given every 51
12 20mg to be given every four hours 50
sub: John was given a prescription of idx, 20mg to be given every four hours
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.
string2 fuzzy_ratio
0 time of discomfort was 3 days ago. 41
1 time of discomfort was 3 days 41
2 time of discomfort was 3 40
3 of discomfort was 3 days 40
4 The time of discomfort was 3 days ago. 40
5 of discomfort was 3 days ago. 39
6 The time of discomfort was 3 days 39
7 The time of discomfort was 38
8 The time of discomfort was 3 35
9 discomfort was 3 days ago. 34
sub: The time of discomfort was 3 days ago.
base: John was given a prescription of idx, 20mg to be given every four hours
CSV输出示例
string2_string1.csv
使用Spacy相似度
以下是使用Spacy比较text3的子字符串和text1完整文本的结果。
下面的结果可用于与上面的第二个表格进行比较,以查看哪种方法呈现了更好的相似度排名。
我使用大型模型得到下面的结果。
代码
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_lg")
text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"
text3_sub = [
'be given every four hours', 'idx, 20mg to be given every four hours',
'was given a prescription of idx, 20mg to be given every four hours',
'to be given every four hours',
'John was given a prescription of idx, 20mg to be given every four hours',
'of idx, 20mg to be given every four hours',
'was given a prescription of idx, 20mg to be given every four',
'prescription of idx, 20mg to be given every four hours',
'given a prescription of idx, 20mg to be given every four hours',
'a prescription of idx, 20mg to be given every four hours',
'John was given a prescription of idx, 20mg to be given every four',
'idx, 20mg to be given every',
'20mg to be given every four hours'
]
data = []
for s in text3_sub:
doc1 = nlp(s)
doc2 = nlp(text1)
sim = round(doc1.similarity(doc2), 3)
data.append([s, text1, sim])
df = pd.DataFrame(data)
df.columns = ['from text3', 'text1', 'similarity']
df = df.sort_values(by=['similarity'], ascending=[False])
df = df.reset_index(drop=True)
df1 = df[['from text3', 'similarity']]
print(df1.to_string())
print()
print(f'text3: {text3}')
print(f'text1: {text1}')
输出
from text3 similarity
0 was given a prescription of idx, 20mg to be given every four hours 0.904
1 John was given a prescription of idx, 20mg to be given every four hours 0.902
2 a prescription of idx, 20mg to be given every four hours 0.895
3 prescription of idx, 20mg to be given every four hours 0.893
4 given a prescription of idx, 20mg to be given every four hours 0.892
5 of idx, 20mg to be given every four hours 0.889
6 idx, 20mg to be given every four hours 0.883
7 was given a prescription of idx, 20mg to be given every four 0.879
8 John was given a prescription of idx, 20mg to be given every four 0.877
9 20mg to be given every four hours 0.877
10 idx, 20mg to be given every 0.835
11 to be given every four hours 0.834
12 be given every four hours 0.832
text3: John was given a prescription of idx, 20mg to be given every four hours
text1: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.
看起来spacy方法产生了一个很好的相似度排名。