Pandas的Groupby Agg函数无法缩减数据

Question

Pandas的Groupby Agg函数无法缩减数据

24

我一直在使用一种聚合函数，这个函数已经在我的工作中使用了很长时间。其思想是，如果传递给函数的Series长度为1（即该组只有一个观测值），则返回该观测值。如果传递的Series长度大于1，则将观测值返回为列表。

这可能对某些人来说看起来很奇怪，但这不是X,Y问题，我有充分的理由想要这样做，这与本问题无关。

这是我一直在使用的函数：

def MakeList(x):
    """ This function is used to aggregate data that needs to be kept distinc within multi day 
        observations for later use and transformation. It makes a list of the data and if the list is of length 1
        then there is only one line/day observation in that group so the single element of the list is returned. 
        If the list is longer than one then there are multiple line/day observations and the list itself is 
        returned."""
    L = x.tolist()
    if len(L) > 1:
        return L
    else:
        return L[0]

出于某些原因，使用我目前正在处理的数据集时，我会收到一个ValueError错误，指出该函数无法缩减。这里是一些测试数据和我正在使用的剩余步骤：

import pandas as pd
DF = pd.DataFrame({'date': ['2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02'],
                    'line_code':   ['401101',
                                    '401101',
                                    '401102',
                                    '401103',
                                    '401104',
                                    '401105',
                                    '401105',
                                    '401106',
                                    '401106',
                                    '401107'],
                    's.m.v.': [ 7.760,
                                25.564,
                                25.564,
                                9.550,
                                4.870,
                                7.760,
                                25.564,
                                5.282,
                                25.564,
                                5.282]})
DFGrouped = DF.groupby(['date', 'line_code'], as_index = False)
DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})

为了尝试调试这个问题，我加入了一些打印语句，例如print L 和 print x.index，输出结果如下：

在尝试调试该问题时，我添加了以下打印语句：print L 和 print x.index，输出如下：

(Note: The original text appears to contain a typo, as "effect" is misspelled as "affect". I have corrected this in my translations.)

[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')
[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')

因为某种原因，看起来 agg 将 Series 两次传递给函数。据我所知，这完全不正常，很可能是我的函数不能缩小的原因。

例如，如果我写了这样一个函数：

def test_func(x):
    print x.index
    return x.iloc[0]

这个可以顺利运行，打印语句如下：

DF_Agg = DFGrouped.agg({'s.m.v.' : test_func})

Int64Index([0, 1], dtype='int64')
Int64Index([2], dtype='int64')
Int64Index([3], dtype='int64')
Int64Index([4], dtype='int64')
Int64Index([5, 6], dtype='int64')
Int64Index([7, 8], dtype='int64')
Int64Index([9], dtype='int64')

这表明每个组只被作为 Series 一次传递给函数。

有人能帮我理解为什么会失败吗？我在许多数据集中成功使用了这个函数...

谢谢

- Woody Pride

2

如果你的函数有时返回列表，有时返回单个值，那么pandas可能会感到困惑，因为这两种情况下将使用不同的数据类型。最好不要这样做。调用两次的行为可能与此处描述的问题有关：对于“apply”，它在第一组上调用函数两次，以检查函数是否会改变现有数据。参考链接：https://dev59.com/gWEi5IYBdhLWcg3wUa86 - BrenBarn

嗯……也许我应该尝试将其设置为对象数据类型。 - Woody Pride

最奇怪的是，我一直在重复使用这段代码而没有任何问题。我知道apply和transform会传递不同的数据包，因此很难从打印语句中确定发生了什么，但agh相当简单明了。你能否重现这个错误？ - Woody Pride

我能够复现这个错误，但是无法复现它正常工作的情况。你的 test_func 函数之所以可以减少计算量是因为它只返回了单个值。你有一个聚合函数返回列表的可用示例吗？它曾经对你有效过吗？ - BrenBarn

是的，自从我写了这个该死的东西以来，它已经工作了一年多，这就是为什么我感到如此困惑。我会尝试生成一些数据来测试它是否正常工作。 - Woody Pride

1

一个有趣的解决方案是返回 tuple(L) 而不是 L。 - Woody Pride

2个回答

17

这是DataFrame中的一个缺陷。如果聚合器为第一组返回一个列表，它将失败并出现你提到的错误；如果它为第一组返回非列表（非Series）值，则可以正常工作。有问题的代码位于groupby.py中：

def _aggregate_series_pure_python(self, obj, func):

    group_index, _, ngroups = self.group_info

    counts = np.zeros(ngroups, dtype=int)
    result = None

    splitter = get_splitter(obj, group_index, ngroups, axis=self.axis)

    for label, group in splitter:
        res = func(group)
        if result is None:
            if (isinstance(res, (Series, Index, np.ndarray)) or
                    isinstance(res, list)):
                raise ValueError('Function does not reduce')
            result = np.empty(ngroups, dtype='O')

        counts[label] = group.shape[0]
        result[label] = res

请注意，if result is None和isinstance(res, list两者选其一。

欺骗groupby().agg()，使其不会将列表视为第一组，或者
自行聚合，使用类似上面的代码但不包含错误测试的代码。

- Nik Bates-Haus

2

正如其他答案所解释的那样，“元组”将完美地工作。这正是因为上述函数没有检查对象是否为“元组”。是一个错误还是一个特性 - 由您决定！ - Ufos

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- paulo.filip3 · Accepted Answer

我无法解释为什么，但从我的经验来看，在 pandas.DataFrame 中使用 list 并不是很好。

我通常改用 tuple。这样会起作用：

def MakeList(x):
    T = tuple(x)
    if len(T) > 1:
        return T
    else:
        return T[0]

DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})

     date line_code           s.m.v.
0  2013-04-02    401101   (7.76, 25.564)
1  2013-04-02    401102           25.564
2  2013-04-02    401103             9.55
3  2013-04-02    401104             4.87
4  2013-04-02    401105   (7.76, 25.564)
5  2013-04-02    401106  (5.282, 25.564)
6  2013-04-02    401107            5.282