Pandas DataFrame中反转列顺序的Big O复杂度是什么？

Question

Pandas DataFrame中反转列顺序的Big O复杂度是什么？

15

假设我在pandas中有一个包含m行n列的DataFrame。我想要反转这些列的顺序，可以使用以下代码实现：

df_reversed = df[df.columns[::-1]]

这个操作的时间复杂度是什么？我猜这将取决于列数，但也会取决于行数吗？

- Tim Holdsworth

5

自己的备忘录：设置这样的测试时，要选择最低级别的数量级 :) 导致笔记本电脑崩溃。 - roganjosh

1

如果你追求性能，使用切片df.iloc[:,::-1]，它返回一个视图，因此几乎是免费的，而不是使用df[df.columns[::-1]]，后者会创建一个副本，因为你在索引。 - Divakar

2

@Divakar，一般来说，这只适用于iloc，还是loc也返回视图？可能超出了单个评论的范围，但我也对为什么通过df[col_list]进行直接索引应该返回副本感兴趣（这是设计选择/副作用/是否有任何好处？）。 - jpp

@divakar 如果我返回一个视图，我还能对其进行操作，然后再次反转列的顺序，并最终得到应用了操作的原始数据框吗？ - Tim Holdsworth

1

@TimHoldsworth 一旦您执行操作，就会创建一个副本。 - Divakar

3个回答

4

截至Pandas 0.24，Big O复杂度为m*n，其中m是列数，n是行数。请注意，这是使用DataFrame.__getitem__方法（也称[]）与Index一起使用时的情况（请参见相关代码，其他类型将触发复制）。

以下是有用的堆栈跟踪：

 <ipython-input-4-3162cae03863>(2)<module>()
      1 columns = df.columns[::-1]
----> 2 df_reversed = df[columns]

  pandas/core/frame.py(2682)__getitem__()
   2681             # either boolean or fancy integer index
-> 2682             return self._getitem_array(key)
   2683         elif isinstance(key, DataFrame):

  pandas/core/frame.py(2727)_getitem_array()
   2726             indexer = self.loc._convert_to_indexer(key, axis=1)
-> 2727             return self._take(indexer, axis=1)
   2728 

  pandas/core/generic.py(2789)_take()
   2788                                    axis=self._get_block_manager_axis(axis),
-> 2789                                    verify=True)
   2790         result = self._constructor(new_data).__finalize__(self)

  pandas/core/internals.py(4539)take()
   4538         return self.reindex_indexer(new_axis=new_labels, indexer=indexer,
-> 4539                                     axis=axis, allow_dups=True)
   4540 

  pandas/core/internals.py(4421)reindex_indexer()
   4420             new_blocks = self._slice_take_blocks_ax0(indexer,
-> 4421                                                      fill_tuple=(fill_value,))
   4422         else:

  pandas/core/internals.py(1254)take_nd()
   1253             new_values = algos.take_nd(values, indexer, axis=axis,
-> 1254                                        allow_fill=False)
   1255         else:

> pandas/core/algorithms.py(1658)take_nd()
   1657     import ipdb; ipdb.set_trace()
-> 1658     func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis,
   1659                                  mask_info=mask_info)
   1660     func(arr, indexer, out, fill_value)

在pandas/core/algorithms的L1660上调用func最终会调用一个复杂度为O(m * n)的cython函数。这是将原始数据复制到out中的地方。 out包含原始数据的逆序副本。

    inner_take_2d_axis0_template = """\
    cdef:
        Py_ssize_t i, j, k, n, idx
        %(c_type_out)s fv

    n = len(indexer)
    k = values.shape[1]

    fv = fill_value

    IF %(can_copy)s:
        cdef:
            %(c_type_out)s *v
            %(c_type_out)s *o

        #GH3130
        if (values.strides[1] == out.strides[1] and
            values.strides[1] == sizeof(%(c_type_out)s) and
            sizeof(%(c_type_out)s) * n >= 256):

            for i from 0 <= i < n:
                idx = indexer[i]
                if idx == -1:
                    for j from 0 <= j < k:
                        out[i, j] = fv
                else:
                    v = &values[idx, 0]
                    o = &out[i, 0]
                    memmove(o, v, <size_t>(sizeof(%(c_type_out)s) * k))
            return

    for i from 0 <= i < n:
        idx = indexer[i]
        if idx == -1:
            for j from 0 <= j < k:
                out[i, j] = fv
        else:
            for j from 0 <= j < k:
                out[i, j] = %(preval)svalues[idx, j]%(postval)s
"""

请注意，在上述模板函数中，有一个使用了memmove的路径（在这种情况下采取的路径是因为我们将int64映射到int64，并且输出的维度与输入的维度相同，只是交换了索引）。请注意，memmove仍然是O(n)，与它需要复制的字节数成比例，尽管比直接写入索引要快得多。

- akosel

1

我已经添加了更多的示例，展示了在调用__getitem__和cython函数后所采取的路径的更多上下文，这最终是大量时间花费的地方。__getitem__中的逻辑并不总是直观的，但我发现这个GH问题对于解释不同输入在幕后执行的操作非常有帮助 https://github.com/pandas-dev/pandas/issues/9595。 - akosel

谢谢，这很大程度上解释了我们所看到的行为。 - jpp

-1

我使用big_O拟合库在这里进行经验测试。

注意：所有测试都是在独立变量上进行的，其范围涵盖了6个数量级（即：

rows从10到10^6，而column大小恒定为3;

columns从10到10^6，而row大小恒定为10。

结果显示，在DataFrame中，columns反转操作.columns[::-1]的复杂度为

立方级：O(n^3)，其中n是rows的数量
立方级：O(n^3)，其中n是columns的数量

前提条件：您需要使用终端命令pip install big_o安装big_o()。代码

import big_o
import pandas as pd
import numpy as np

SWEAP_LOG10 = 6
COLUMNS = 3
ROWS = 10

def build_df(rows, columns):
    # To isolated the creation of the DataFrame from the inversion operation.
    narray = np.zeros(rows*columns).reshape(rows, columns)
    df = pd.DataFrame(narray)
    return df

def flip_columns(df):
    return df[df.columns[::-1]]

def get_row_df(n, m=COLUMNS):
    return build_df(1*10**n, m)

def get_column_df(n, m=ROWS):
    return build_df(m, 1*10**n)


# infer the big_o on columns[::-1] operation vs. rows
best, others = big_o.big_o(flip_columns, get_row_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)

# print results
print('Measuring .columns[::-1] complexity against rapid increase in # rows')
print('-'*80 + '\nBig O() fits: {}\n'.format(best) + '-'*80)

for class_, residual in others.items():
    print('{:<60s}  (res: {:.2G})'.format(str(class_), residual))

print('-'*80)

# infer the big_o on columns[::-1] operation vs. columns
best, others = big_o.big_o(flip_columns, get_column_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)

# print results
print()
print('Measuring .columns[::-1] complexity against rapid increase in # columns')
print('-'*80 + '\nBig O() fits: {}\n'.format(best) + '-'*80)

for class_, residual in others.items():
    print('{:<60s}  (res: {:.2G})'.format(str(class_), residual))
    
print('-'*80)

结果

Measuring .columns[::-1] complexity against rapid increase in # rows
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.017 + 0.00067*n^3
--------------------------------------------------------------------------------
Constant: time = 0.032                                        (res: 0.021)
Linear: time = -0.051 + 0.024*n                               (res: 0.011)
Quadratic: time = -0.026 + 0.0038*n^2                         (res: 0.0077)
Cubic: time = -0.017 + 0.00067*n^3                            (res: 0.0052)
Polynomial: time = -6.3 * x^1.5                               (res: 6)
Logarithmic: time = -0.026 + 0.053*log(n)                     (res: 0.015)
Linearithmic: time = -0.024 + 0.012*n*log(n)                  (res: 0.0094)
Exponential: time = -7 * 0.66^n                               (res: 3.6)
--------------------------------------------------------------------------------


Measuring .columns[::-1] complexity against rapid increase in # columns
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.28 + 0.009*n^3
--------------------------------------------------------------------------------
Constant: time = 0.38                                         (res: 3.9)
Linear: time = -0.73 + 0.32*n                                 (res: 2.1)
Quadratic: time = -0.4 + 0.052*n^2                            (res: 1.5)
Cubic: time = -0.28 + 0.009*n^3                               (res: 1.1)
Polynomial: time = -6 * x^2.2                                 (res: 16)
Logarithmic: time = -0.39 + 0.71*log(n)                       (res: 2.8)
Linearithmic: time = -0.38 + 0.16*n*log(n)                    (res: 1.8)
Exponential: time = -7 * 1^n                                  (res: 9.7)
--------------------------------------------------------------------------------

- helcode

@MadPhysicist，您能详细说明一下“小数量订单的开销”吗？ - helcode

2

另一个答案做得非常好。 - Mad Physicist

O(n)是线性的，即列表搜索，O(n^c)是多项式的，即嵌套循环和递归循环。它们是两个不同的概念。 - helcode

1

你只是对有限数量的点进行拟合，而低端误差使你偏离了轨道。另一个答案实际上是在看大O符号。我严重怀疑pandas中的任何基本操作都不是O(n^3)。 - Mad Physicist

Big O需要考虑最坏情况，但我认为你说得对。我会重新配置测试参数，看看是否能得出不同的答案。 - helcode

显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Shovalt · Accepted Answer

我不知道Pandas如何实现这个功能，但我已经进行了实验测试。我在Jupyter笔记本中运行了以下代码以测试操作速度：

def get_dummy_df(n):
    return pd.DataFrame({'a': [1,2]*n, 'b': [4,5]*n, 'c': [7,8]*n})

df = get_dummy_df(100)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(1000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(10000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(100000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(1000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(10000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

输出结果为：

(200, 3)
1000 loops, best of 3: 419 µs per loop
(2000, 3)
1000 loops, best of 3: 425 µs per loop
(20000, 3)
1000 loops, best of 3: 498 µs per loop
(200000, 3)
100 loops, best of 3: 2.66 ms per loop
(2000000, 3)
10 loops, best of 3: 25.2 ms per loop
(20000000, 3)
1 loop, best of 3: 207 ms per loop

如您所见，前三个案例中，操作的开销占据了大部分时间（400-500微秒），但从第四个案例开始，所需时间开始与数据量成正比，每次增加一个数量级。

因此，假设也存在与 n 成比例的比例，似乎我们正在处理 O(m*n)。