循环中的字典用于 pd.DataFrame

3

我有许多数据集中的列,需要更改某些变量的值。 我按照以下步骤进行:

import pandas as pd
import numpy as np
df = pd.DataFrame({'one':['a' , 'b']*5, 'two':['c' , 'd']*5, 'three':['a' , 'd']*5})

选择
df1 = df[['one', 'two']]

字典

map = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'}

循环

df2=[]
for i in df1.values:
    np = [ map[x] for x in i]
    df2.append(np)

然后我修改列

df['one'] = [row[0] for row in df2]
df['two'] = [row[1] for row in df2]

代码可以正常运行,但是很冗长。如何让它更简洁?


1
df.replace 是一个 Pandas 库中的函数,用于替换数据框中的值。更多信息请参考:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html - DeepSpace
3个回答

2
您可以使用Series.map()迭代列:
cols = ['one', 'two']
mapd = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'}

for col in cols:
    df[col] = df[col].map(mapd).fillna(df[col])


df
Out: 
  one three two
0   d     a   b
1   c     d   a
2   d     a   b
3   c     d   a
4   d     a   b
5   c     d   a
6   d     a   b
7   c     d   a
8   d     a   b
9   c     d   a

时间:

df = pd.DataFrame({'one':['a' , 'b']*5000000, 
                   'two':['c' , 'd']*5000000, 
                   'three':['a' , 'd']*5000000})

%%timeit
for col in cols:
    df[col].map(mapd).fillna(df[col])
1 loop, best of 3: 1.71 s per loop

%%timeit
for col in cols:
...  colSet = set(df[col].values);
...  colMap = {k:v for k,v in mapd.items() if k in colSet}
...  df.replace(to_replace={col:colMap})
1 loop, best of 3: 3.35 s per loop


%timeit df[cols].stack().map(mapd).unstack()
1 loop, best of 3: 9.18 s per loop

2

将整个地图传递给仅具有'a'、'b'值的列是不高效的。首先检查df列中有哪些值。然后仅为它们映射,如下所示:

>>> cols = ['one', 'two'];
>>> map = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'};

>>> for col in cols:
...  colSet = set(df[col].values);
...  colMap = {k:v for k,v in map.items() if k in colSet};
...  df.replace(to_replace={col:colMap},inplace=True);#not efficient like rly
...  
>>> df
  one three two
0   d     a   b
1   c     d   a
2   d     a   b
3   c     d   a
4   d     a   b
5   c     d   a
6   d     a   b
7   c     d   a
8   d     a   b
9   c     d   a
>>>
#OR
In [12]: %%timeit
...: for col in cols:
...:  colSet = set(df[col].values);
...:  colMap = {k:v for k,v in map.items() if k in colSet};
...:  df[col].map(colMap)
...:
...:
1 loop, best of 3: 1.93 s per loop 
#OR WHEN INPLACE
In [8]: %%timeit
   ...: for col in cols:
   ...:  colSet = set(df[col].values);
   ...:  colMap = {k:v for k,v in map.items() if k in colSet};
   ...:  df[col]=df[col].map(colMap)
   ...:
   ...:
1 loop, best of 3: 2.18 s per loop

那也是可能的:
df = pd.DataFrame({'one':['a' , 'b']*5, 'two':['c' , 'd']*5, 'three':['a' , 'd']*5})
map = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'}
cols = ['one','two']

def func(s):
    if s.name in cols:
        s=s.map(map)
    return s

print df.apply(func)

同时注意重叠的键(即,如果您想并行更改a到b和b到c,但不是像a->b->c那样)...

>>> cols = ['one', 'two'];
>>> map = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'};
>>> mapCols = {k:map for k in cols};
>>> df.replace(to_replace=mapCols,inplace=True);
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Q:\Miniconda3\envs\py27a\lib\site-packages\pandas\core\generic.py", line 3352, in replace
    raise ValueError("Replacement not allowed with "
ValueError: Replacement not allowed with overlapping keys and values

这个比那个低效的慢两倍。 - ayhan
我没有仔细检查(这只是猜测,但从逻辑上来看应该不会错,只是我的实现可能不够快;/)。这个 df.replace 不高效吗? - yourstruly
替换通常较慢(即使将其应用于整个DataFrame,而不是通过map循环列),因为map更具体和有限。我认为差异并不来自您的实现。是什么让您认为Series.map()的实际实现浪费时间浏览不存在的键? - ayhan

1
df = pd.DataFrame({'one':['a' , 'b']*5, 'two':['c' , 'd']*5, 'three':['a' , 'd']*5})
m = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'}

cols = ['one', 'two']
df[cols] = df[cols].stack().map(m).unstack()
df

enter image description here


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接