我有一个大型数据框需要与另一个数据框进行比较并更正其ID。我将通过这个简单的示例说明我的问题。
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm', disable=['ner'])
ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})
df = pd.DataFrame({'id':['nan','nan','nan'],
'description':['JOHN HAS 25 YEAR OLD LIVES IN At/12','STEVE has 50 OLD LIVES IN At.14','ALICIE HAS 10 YEAR OLD LIVES IN AT13']})
print(df)
df1 = pd.DataFrame({'id':[1203,1205,1045],
'description':['JOHN HAS 25year OLD LIVES IN At 2','STEVE has 50year OLD LIVES IN At 14','ALICIE HAS 10year OLD LIVES IN At 13']})
print(df1)
age = ["50year", "25year", "10year"]
for a in age:
ruler.add_patterns([{"label": "age", "pattern": a}])
names = ["JOHN", "STEVE", "ALICIA"]
for n in names:
ruler.add_patterns([{"label": "name", "pattern": n}])
ref = ["AT 2", "At 13", "At 14"]
for r in ref:
ruler.add_patterns([{"label": "ref", "pattern": r}])
#exp to check text difference
doc = nlp("JOHN has 25 YEAR OLD LIVES IN At.12 ")
for ent in doc.ents:
print(ent, ent.label_)
实际上,这两个数据帧df和参考数据帧df1的文本存在差异,如下图所示:
我不知道如何在这种情况下获得100%的相似度。
我尝试使用spacy,但我不知道如何修复差异并更正df中的id。
这是我的数据帧1:
id description
0 nan STEVE has 50 OLD LIVES IN At.14
1 nan JOHN HAS 25 YEAR OLD LIVES IN At/12
2 nan ALICIE HAS 10 YEAR OLD LIVES IN AT15
这是我的参考数据框:
id description
0 1203 STEVEN HAS 25year OLD lives in At 6
1 1205 JOHN HAS 25year OLD LIVES IN At 2
2 1045 ALICIE HAS 50year OLD LIVES IN At 13
3 3045 STEVE HAS 50year OLD LIVES IN At 14
4 3465 ALICIE HAS 10year OLD LIVES IN At 13
My expected output:
id description
0 3045 STEVE has 50 OLD LIVES IN At.14
1 1205 JOHN HAS 25 YEAR OLD LIVES IN At/12
2 3465 ALICIE HAS 10year OLD LIVES IN AT15
NB: 这些句子的顺序不同 / 数据框长度不相等
spacy
中使用它们(如果你知道所有令牌的变化)来匹配特定的令牌,比如age
或者at
。但是我不确定如何将其添加到管道中。 - SpaceBurger