为了提高性能,你最好使用NumPy数组并使用np.where
函数 -
a = df.values
df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
运行时间测试
def numpy_based(df):
a = df.values
df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
时间 -
In [271]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])
In [272]: %timeit numpy_based(df)
1000 loops, best of 3: 380 µs per loop
In [273]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])
In [274]: %timeit df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
100 loops, best of 3: 3.39 ms per loop
In [275]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])
In [276]: %timeit df['C'] = np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
1000 loops, best of 3: 1.12 ms per loop
In [277]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])
In [278]: %timeit df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
1000 loops, best of 3: 1.19 ms per loop
仔细观察
让我们仔细观察NumPy的数值计算能力,并将其与pandas进行比较 -
# Extract out as array (its a view, so not really expensive
# .. as compared to the later computations themselves)
In [291]: a = df.values
In [296]: %timeit df.values
10000 loops, best of 3: 107 µs per loop
案例#1:使用NumPy数组并使用numpy.where:
In [292]: %timeit np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
10000 loops, best of 3: 86.5 µs per loop
再次,将值赋给新的列df['C']
也不会很昂贵 -
In [300]: %timeit df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
1000 loops, best of 3: 323 µs per loop
案例#2:与 Pandas 数据框架一起使用其 .where
方法(不使用 NumPy)
In [293]: %timeit df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
100 loops, best of 3: 3.4 ms per loop
案例 #3:使用pandas数据帧(不使用NumPy数组),但使用numpy.where
-
In [294]: %timeit np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
1000 loops, best of 3: 764 µs per loop
案例#4:再次使用pandas dataframe(而不是NumPy数组),但使用numpy.where
-
In [295]: %timeit np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
1000 loops, best of 3: 830 µs per loop
numpy.where
。 - IanS