我有一个表格需要将其映射到两个值在NY,CA本土的WT是外部的,除此之外,它必须是OVERSEAS
di = {"NY": "Domestic","CA": "Domestic","WT":"OUTSIDE"}
df.replace({'Territory': di})
如何在上述代码中使用“OVERSEAS”。因此,默认情况下它没有(字典中没有)OVERSEAS。
get
和 defaultdict
方法的优点是它们避免了在映射之后回头查找整个序列以替换 NAs,而是在映射步骤中完成它。df = pd.DataFrame({'Territory':['NY','CA','WT','SK','DE']})
di = {"NY": "Domestic","CA": "Domestic","WT":"OUTSIDE"}
df['Territory'] = df['Territory'].map(lambda x: di.get(x, 'OVERSEAS'))
此方法性能的一些时间数据如下:
df = pd.DataFrame({'Territory':['NY','CA','WT','SK','DE']})
di = {"NY": "Domestic","CA": "Domestic","WT":"OUTSIDE"}
%timeit df['Territory'].map(lambda x: di.get(x, 'OVERSEAS'))
>>> 138 µs ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
from collections import defaultdict
dd = defaultdict(lambda:'OVERSEAS')
dd.update(di)
%timeit df['Territory'].map(di)
>>> 143 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df['Territory'] = df['Territory'].map(di).fillna('OVERSEAS')
>>> 657 µs ± 33.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
对于较大的字典,性能差异变得更加明显:
有趣的是,如果没有默认值,在 Pandas 中仅映射缺少项的字典似乎会很慢。
%timeit df['Territory'].map(di)
>>> 372 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
使用 Series.map
方法可以返回匹配不上的值的缺失值,因此可以使用 Series.fillna
方法将其替换为默认值:
df = pd.DataFrame({'Territory':['NY','CA','WT','SK','DE']})
di = {"NY": "Domestic","CA": "Domestic","WT":"OUTSIDE"}
print (df)
Territory
0 NY
1 CA
2 WT
3 SK
4 DE
df['Territory'] = df['Territory'].map(di).fillna('OVERSEAS')
print (df)
Territory
0 Domestic
1 Domestic
2 OUTSIDE
3 OVERSEAS
4 OVERSEAS
df
示例,并详细说明期望的输入/输出吗? - Beny Gjdf
是否是另一个数据帧的切片。 - Quang Hoang