Pandas性能优化：列选择

Question

Pandas性能优化：列选择

7

今天我发现选择数据框的两列或多列可能比只选择一列要慢得多。

如果我使用loc或iloc选择多列，并使用列表传递列名或索引，则与仅使用iloc选择单个列或多个列（但未传递列表）相比，性能下降了100倍。

示例：

df = pd.DataFrame(np.random.randn(10**7,10), columns=list('abcdefghij'))

单列选择：

%%timeit -n 100
df['b']
3.17 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
df.iloc[:,1]
66.7 µs ± 5.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
df.loc[:,'b']
44.2 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

两列选择：

%%timeit -n 10
df[['b', 'c']]
96.4 ms ± 788 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
df.loc[:,['b', 'c']]
99.4 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
df.iloc[:,[1,2]]
97.6 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

只有这个选择项能够按预期工作： [编辑]

%%timeit -n 100
df.iloc[:,1:3]
103 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

这些机制的区别在哪里，为什么差距如此之大？

[编辑]：正如@run-out指出的那样，pd.Series似乎比pd.DataFrame处理速度更快，有人知道为什么会这样吗？

另一方面，这并不能解释df.iloc[:,[1,2]]和df.iloc[:,1:3]之间的区别。

- Daniel R

1

你正在选择 df.iloc[:, 1:2] 中的一列。 - Mohit Motwani

我想这与使用loc和iloc切片行和列有关。这就是为什么如果您在使用[]运算符切片列之前使用[]运算符切片行，您会得到类似于loc和iloc的结果：df[:]['b']。我还想象时间差异与返回Series或DataFrame有关。 - It_is_Chris

@MohitMotwani 谢谢，已更正，但结果并未改变。 - Daniel R

3个回答

4

我发现这个问题可能与numpy有关。

numpy有两种索引方式：

基本索引（如a[1:3]）
高级索引（如a[[1,2]]）

根据文档，

高级索引总是返回数据的副本（与基本切片相反，它返回视图）。

因此，如果你检查一下

a=df.values
%timeit -n2 a[:,0:3]
%timeit -n2 a[:,[0,1,2]]

你可以得到

The slowest run took 5.06 times longer than the fastest. This could mean that an intermediate result is being cached.
1.57 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 2 loops each)
188 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)

本文提到的行为与 pandas dataframe 相当相似。

- user15964

0

我只能推荐使用cudf库，它基本上是将pandas移植到Nvidia GPU上。与pandas相反，大多数操作高度并行化，因此速度非常快。

但需要注意的是，您需要有一块Nvidia GPU可用，从GTX 10xx开始支持。

列切片非常快，我会在有时间时提供基准测试结果。

- Oleg

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- run-out · Accepted Answer

Pandas使用pandas.Series处理单行或单列数据，这比在DataFrame结构中处理更快。

当您需要时，Pandas使用pandas.Series来处理：

%%timeit -n 10
df['b']
2.31 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

然而，如果我将其放在列表中，就可以调用相同列的DataFrame。那么你会得到：

%%timeit -n 10
df[['b']]
90.7 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

你可以从上面看到，Series的表现优于DataFrame。

以下是Pandas如何使用列“b”。

type(df['b'])
pandas.core.series.Series

type(df[['b']])
pandas.core.frame.DataFrame

编辑：由于OP想深入了解为什么pd.series与pd.dataframe相比速度更快，我将进一步扩展我的答案。同时，这也是一个很好的问题，可以拓展我们对底层技术如何工作的理解。请那些有更多专业知识的人参与讨论。

首先，让我们从numpy开始，因为它是pandas的一个构建块。根据《Python数据分析》和pandas作者Wes McKinney的说法，numpy在性能上比python有所提升：

This is based partly on performance differences having to do with the
cache hierarchy of the CPU; operations accessing contiguous blocks of memory (e.g.,
summing the rows of a C order array) will generally be the fastest because the mem‐
ory subsystem will buffer the appropriate blocks of memory into the ultrafast L1 or
L2 CPU cache.

我们来看一下这个例子的速度差异。让我们从数据框的列“b”中创建一个numpy数组。

a = np.array(df['b'])

现在进行性能测试：

%%timeit -n 10
a

结果如下：

32.5 ns ± 28.2 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)

相对于 pd.series 的执行时间2.31微秒，这是一次严重的性能提升。

性能提升的另一个主要原因在于，NumPy 的索引直接进入 NumPy C 扩展，但是当您索引到 Series 时，会发生大量的 Python 操作，这会导致速度较慢。(阅读本文)

现在让我们来看看为什么：

df.iloc[:,1:3]

大幅超过预期：

df.iloc[:,[1,2]]

有趣的是，在这种情况下，.loc 与 .iloc 的性能影响相同。

我们第一个发现事情不对劲的线索在下面的代码中：

df.iloc[:,1:3] is df.iloc[:,[1,2]]
False

这两个表达式得到相同的结果，但是它们是不同的对象。我进行了深入研究，试图找出它们之间的区别。但是我在互联网上和我的书库中都没有找到相关参考资料。

通过查看源代码，我们可以开始看到一些差异。我指的是 indexing.py。

在 _iLocIndexer 类中，我们可以发现 pandas 为 iloc 切片中的列表做了一些额外的工作。

当检查输入时，我们立即遇到这两个差异：

if isinstance(key, slice):
            return

对比。

elif is_list_like_indexer(key):
            # check that the key does not exceed the maximum size of the index
            arr = np.array(key)
            l = len(self.obj._get_axis(axis))

            if len(arr) and (arr.max() >= l or arr.min() < -l):
                raise IndexError("positional indexers are out-of-bounds")

单单这个原因就足够导致性能降低吗？我不确定。

虽然 .loc 稍有不同，但它也会在使用值列表时降低性能。查看 index.py 文件，找到 class _LocIndexer(_LocationIndexer) 中的 def _getitem_axis(self, key, axis=None): --> 部分：

处理列表输入的 is_list_like_indexer(key) 代码部分相当冗长，包含了很多开销。其中包含一条注释：

# convert various list-like indexers
# to a list of keys
# we will use the *values* of the object
# and NOT the index if its a PandasObject

处理值列表或整数时，相比于直接切片会增加足够的开销，这可能会导致处理延迟。

其余代码超出了我的薪资等级。如果有人能够查看并进行评价，那将不胜感激。