你可以使用
np.setdiff1d()
:
df['A-B']=df.apply(lambda x: ' '.join(np.setdiff1d(x['A'].lower().split(),
x['B'].lower().split())),axis=1)
print(df)
A B A-B
0 Stack Overlflow is great stack great is overlflow
你的解决方案已经接近成功了,只需要在将它们压缩时添加
series.str.lower()
即可:
df['A-B']=[' '.join(set(a.split())-set(b.split()))
for a, b in zip(df['A'].str.lower(), df['B'].str.lower())]
如果该系列具有重复的字符串,请使用
OrderedDict
来帮助去除重复项,类似于
set()
,但仍保持顺序:
df = pd.DataFrame({'A': ['Stack Overlflow is great is great'], 'B': ['stack great']})
A B
0 Stack Overlflow is great is great stack great
from collections import OrderedDict
df['A-B']=[' '.join([ele for ele in OrderedDict.fromkeys(a) if ele not in b ])
for a,b in zip(df.A.str.lower().str.split(),df.B.str.lower().str.split())]
print(df)
A B A-B
0 Stack Overlflow is great is great stack great overlflow is
setdiff1d
,我正在阅读它:)。它没有提到排序。它是否保留原始顺序? - Erfan