使用pandas DataFrame进行分组(GroupBy),并选择最常见的值。

191

我有一个数据框,其中有三列字符串。我知道第三列中只有一个值对于前两列的每个组合是有效的。为了清理数据,我需要按照前两列对数据框进行分组,并选择每个组合中第三列的最常见值。

我的代码:

import pandas as pd
from scipy import stats

source = pd.DataFrame({
    'Country': ['USA', 'USA', 'Russia', 'USA'], 
    'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
    'Short name': ['NY', 'New', 'Spb', 'NY']})

source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

代码的最后一行无法运行,它显示了一个KeyError: 'Short name'的错误,如果我仅按城市分组,则会得到一个断言失败的错误。我该怎么解决?

13个回答

254

Pandas >= 0.16

pd.Series.mode函数现已可用!

使用groupbyGroupBy.agg和应用pd.Series.mode函数到每个组:

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

如果需要将其作为DataFrame使用,请使用

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

Series.mode 的有用之处在于它总是返回一个 Series,这使得它与 aggapply 非常兼容,特别是在重构 groupby 输出时。 它还更快。

# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

处理多种模式

Series.mode函数还可以很好地处理多个模式:

source2 = source.append(
    pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
    ignore_index=True)

# Now `source2` has two modes for the 
# ("USA", "New-York") group, they are "NY" and "New".
source2

  Country              City Short name
0     USA          New-York         NY
1     USA          New-York        New
2  Russia  Sankt-Petersburg        Spb
3     USA          New-York         NY
4     USA          New-York        New
source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg          Spb
USA      New-York            [NY, New]
Name: Short name, dtype: object

或者,如果您想为每种模式设置单独的行,可以使用GroupBy.apply

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)

Country  City               
Russia   Sankt-Petersburg  0    Spb
USA      New-York          0     NY
                           1    New
Name: Short name, dtype: object

如果你不在意返回哪种模式只要是它们中的任意一种,那么你将需要一个调用mode并提取第一个结果的lambda函数。

source2.groupby(['Country','City'])['Short name'].agg(
    lambda x: pd.Series.mode(x)[0])

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

考虑的替代方案

您还可以使用 Python 中的 statistics.mode,但是...

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

...在处理多种模式时,它表现不佳;会引发StatisticsError。文档中已经提到:

如果数据为空,或者没有恰好一个最常见的值,则会引发统计错误。

但你可以亲自验证一下...

statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError                           Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values

1
这个解决方案比普通的 df.group_by 慢得多。 - seeker_after_truth
1
如果你的系列中包含 np.nan,你可能想要在 pd.Series.mode 中传递 dropna=False。我有一些全是 np.nan 的系列,在聚合时引发了这个错误:ValueError: Must produce aggregated value - 0not
如果你的序列中包含 np.nan,你可能需要将 dropna=False 传递给 pd.Series.mode。我有一些序列都是 np.nan,在聚合时会引发此错误:ValueError: Must produce aggregated value - 0not
1
@seeker 抱歉,您所说的“常规”df.groupby是什么意思? - wjandrea
1
@seeker 抱歉,你所说的“regular” df.groupby 是什么意思? - wjandrea
显示剩余3条评论

200
您可以使用value_counts()方法获取计数系列,并获取第一行:
source.groupby(['Country','City']).agg(lambda x: x.value_counts().index[0])

如果您想了解如何在.agg()中执行其他聚合函数,请尝试以下方法。

# Let's add a new col, "account"
source['account'] = [1, 2, 3, 3]

source.groupby(['Country','City']).agg(
    mod=('Short name', lambda x: x.value_counts().index[0]),
    avg=('account', 'mean'))

26

虽然我来晚了,但我遇到了一些性能问题,需要另想办法。

它的工作原理是找出每个键值的频率,然后对于每个键,仅保留与其最常出现的值。

还有一个额外的解决方案支持多种模式。

在一个代表我正在处理的数据的比例测试中,这将运行时间从37.4秒减少到0.5秒!

以下是解决方案的代码、一些示例用法和规模测试:

import numpy as np
import pandas as pd
import random
import time

test_input = pd.DataFrame(columns=[ 'key',          'value'],
                          data=  [[ 1,              'A'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              np.nan ],
                                  [ 2,              np.nan ],
                                  [ 3,              'C'    ],
                                  [ 3,              'C'    ],
                                  [ 3,              'D'    ],
                                  [ 3,              'D'    ]])

def mode(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the mode.                                                                                                                                                                                                                                                                                                         

    The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                     
    (null values are not counted). The `key_cols` are included as columns, `value_col`                                                                                                                                                                                                                                                                               
    contains a mode (ties are broken arbitrarily and deterministically) for each                                                                                                                                                                                                                                                                                     
    group, and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                 
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

def modes(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the modes.                                                                                                                                                                                                                                                                                                        

    The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                              
    one mode (null values are not counted). The `key_cols` are included as                                                                                                                                                                                                                                                                                           
    columns, `value_col` contains lists indicating the modes for each group,                                                                                                                                                                                                                                                                                         
    and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                        
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .groupby(key_cols + [count_col])[value_col].unique() \
             .to_frame().reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')

scale_test_data = [[random.randint(1, 100000),
                    str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
                                data=scale_test_data)

start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start

运行这段代码将会打印出类似如下的内容:

   key value
0    1     A
1    1     B
2    1     B
3    1   NaN
4    2   NaN
5    3     C
6    3     C
7    3     D
8    3     D
   key value  count
1    1     B      2
2    3     C      2
   key  count   value
1    1      2     [B]
2    3      2  [C, D]
0.489614009857
9.19386196136
37.4375009537

希望这可以帮到您!


19
对于agg函数,lambba函数获得的是一个Series,该对象没有'Short name'属性。 stats.mode返回两个数组的元组,因此您需要获取该元组中第一个数组的第一个元素。
通过这两个简单的更改:
source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

返回

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

14

这里的两个顶级答案建议:

df.groupby(cols).agg(lambda x:x.value_counts().index[0])
或者更好的是
df.groupby(cols).agg(pd.Series.mode)

然而,在一些简单的边缘情况下,这两种方法都无法正常工作,如下所示:

df = pd.DataFrame({
    'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
    'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
    'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
})

第一点:

df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])

由于组 C 返回了空序列,导致出现 IndexError 错误。第二个:

df.groupby(['client_id', 'date']).agg(pd.Series.mode)

因为第一组返回了两个众数,所以出现了 ValueError: Function does not reduce 错误。(如此处所述,如果第一组只返回一个众数,则代码将正常运行!)

针对这种情况有两个可能的解决方案:

import scipy
x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])

以下是cs95在这里给我的解决方案:

def foo(x): 
    m = pd.Series.mode(x); 
    return m.values[0] if not m.empty else np.nan
df.groupby(['client_id', 'date']).agg(foo)

然而,所有这些方法都很慢,并且不适用于大型数据集。我最终采用的解决方案既可以处理这些情况,而且速度要快得多,它是对abw33回答的轻微修改版本(应该排名更高):

def get_mode_per_column(dataframe, group_cols, col):
    return (dataframe.fillna(-1)  # NaN placeholder to keep group 
            .groupby(group_cols + [col])
            .size()
            .to_frame('count')
            .reset_index()
            .sort_values('count', ascending=False)
            .drop_duplicates(subset=group_cols)
            .drop(columns=['count'])
            .sort_values(group_cols)
            .replace(-1, np.NaN))  # restore NaNs

group_cols = ['client_id', 'date']    
non_grp_cols = list(set(df).difference(group_cols))
output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
for col in non_grp_cols[1:]:
    output_df[col] = get_mode_per_column(df, group_cols, col)[col].values

本质上,该方法逐列操作并输出一个数据框,因此不需要使用复杂的concat函数,你可以将第一列视为一个数据框,并在后续迭代中将输出数组(values.flatten())作为数据框中的一列逐步添加。


1
在pandas 1.4.3版本中,我能够运行df.groupby(['client_id', 'date']).agg(pd.Series.mode)而不会出现ValueError: Function does not reduce错误。 - Benjamin Ziepert
1
在 pandas 1.4.3 中,我能够运行 df.groupby(['client_id', 'date']).agg(pd.Series.mode) 而不出现错误 ValueError: Function does not reduce - Benjamin Ziepert

7

正式来说,正确的答案是 @eumiro 的解决方案。 @HYRY 的解决方案存在问题,当你有一个数字序列如 [1,2,3,4] 时,该解决方案是错误的,即你没有众数。 例子:

>>> import pandas as pd
>>> df = pd.DataFrame(
        {
            'client': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E', 'E', 'E', 'A'], 
            'total': [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 
            'bla': [10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]
        }
    )

如果您像 @HYRY 一样计算,那么您将得到:
>>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
        total  bla
client            
A           4   30
B           4   40
C           1   10
D           3   30
E           2   20

这是明显错误的(请参见应该为1而不是4A值),因为它不能处理唯一的值。

因此,另一个解决方案是正确的:

>>> import scipy.stats
>>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
        total  bla
client            
A           1   10
B           4   40
C           1   10
D           3   30
E           2   20

7

使用DataFrame.value_counts来快速解决问题

以下是前三种解法:

  • source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
  • source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
  • source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0])

对于大型数据集,以上三种方法非常慢。

使用collections.Counter的解决方案要快得多(比前三个方法快20-40倍):

  • source.groupby(['Country', 'City'])['Short name'].agg(lambda srs: Counter(list(srs)).most_common(1)[0][0])

但仍然很慢。

abw333和Josh Friedlander提供的解决方案更快(比使用Counter的方法快约10倍)。这些解决方案可以进一步优化,使用value_counts来替代(自Pandas 1.1.0以来提供了DataFrame.value_counts方法)。

source.value_counts(['Country', 'City', 'Short name']).pipe(lambda x: x[~x.droplevel('Short name').index.duplicated()]).reset_index(name='Count')

为了使函数像Josh Friedlander的函数一样考虑NaN,只需关闭dropna参数:
source.value_counts(['Country', 'City', 'Short name'], dropna=False).pipe(lambda x: x[~x.droplevel('Short name').index.duplicated()]).reset_index(name='Count')

使用abw333的设置,对于一个有100万行的DataFrame进行运行时差异测试,value_counts比abw333的解决方案快约10%。

scale_test_data = [[random.randint(1, 100),
                    str(random.randint(100, 900)), 
                    str(random.randint(0,2))] for i in range(1000000)]
source = pd.DataFrame(data=scale_test_data, columns=['Country', 'City', 'Short name'])
keys = ['Country', 'City']
vals = ['Short name']

%timeit source.value_counts(keys+vals).pipe(lambda x: x[~x.droplevel(vals).index.duplicated()]).reset_index(name='Count')
# 376 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit mode(source, ['Country', 'City'], 'Short name', 'Count')
# 415 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

为了方便使用,我将这个解决方案封装成了一个函数,您可以轻松地复制粘贴并在自己的环境中使用。该函数还可找到多列的分组模式。
def get_groupby_modes(source, keys, values, dropna=True, return_counts=False):
    """
    A function that groups a pandas dataframe by some of its columns (keys) and 
    returns the most common value of each group for some of its columns (values).
    The output is sorted by the counts of the first column in values (because it
    uses pd.DataFrame.value_counts internally).
    An equivalent one-liner if values is a singleton list is:
    (
        source
        .value_counts(keys+values)
        .pipe(lambda x: x[~x.droplevel(values).index.duplicated()])
        .reset_index(name=f"{values[0]}_count")
    )
    If there are multiple modes for some group, it returns the value with the 
    lowest Unicode value (because under the hood, it drops duplicate indexes in a 
    sorted dataframe), unlike, e.g. df.groupby(keys)[values].agg(pd.Series.mode).
    Must have Pandas 1.1.0 or later for the function to work and must have 
    Pandas 1.3.0 or later for the dropna parameter to work.
    -----------------------------------------------------------------------------
    Parameters:
    -----------
    source: pandas dataframe.
        A pandas dataframe with at least two columns.
    keys: list.
        A list of column names of the pandas dataframe passed as source. It is 
        used to determine the groups for the groupby.
    values: list.
        A list of column names of the pandas dataframe passed as source. 
        If it is a singleton list, the output contains the mode of each group 
        for this column. If it is a list longer than 1, then the modes of each 
        group for the additional columns are assigned as new columns.
    dropna: bool, default: True.
        Whether to count NaN values as the same or not. If True, NaN values are 
        treated by their default property, NaN != NaN. If False, NaN values in 
        each group are counted as the same values (NaN could potentially be a 
        most common value).
    return_counts: bool, default: False.
        Whether to include the counts of each group's mode. If True, the output 
        contains a column for the counts of each mode for every column in values. 
        If False, the output only contains the modes of each group for each 
        column in values.
    -----------------------------------------------------------------------------
    Returns:
    --------
    a pandas dataframe.
    -----------------------------------------------------------------------------
    Example:
    --------
    get_groupby_modes(source=df, 
                      keys=df.columns[:2].tolist(), 
                      values=df.columns[-2:].tolist(), 
                      dropna=True,
                      return_counts=False)
    """
    
    def _get_counts(df, keys, v, dropna):
        c = df.value_counts(keys+v, dropna=dropna)
        return c[~c.droplevel(v).index.duplicated()]
    
    counts = _get_counts(source, keys, values[:1], dropna)
    
    if len(values) == 1:
        if return_counts:
            final = counts.reset_index(name=f"{values[0]}_count")
        else:
            final = counts.reset_index()[keys+values[:1]]
    else:
        final = counts.reset_index(name=f"{values[0]}_count", level=values[0])
        if not return_counts:
            final = final.drop(columns=f"{values[0]}_count")
        for v in values:
            counts = _get_counts(source, keys, [v], dropna).reset_index(level=v)
            if return_counts:
                final[[v, f"{v}_count"]] = counts
            else:
                final[v] = counts[v]
        final = final.reset_index()
    return final

5
如果您不想包含NaN值,则使用Counterpd.Series.modepd.Series.value_counts()[0]要快得多。
def get_most_common(srs):
    x = list(srs)
    my_counter = Counter(x)
    return my_counter.most_common(1)[0][0]

df.groupby(col).agg(get_most_common)

应该可以正常工作。当您有NaN值时,此方法会失败,因为每个NaN将被单独计算。


3

不要使用".agg",而是使用更快的".apply",它可以跨列给出结果。

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short name' : ['NY','New','Spb','NY']})
source.groupby(['Country', 'City'])['Short name'].apply(pd.Series.mode).reset_index()

2

如果您希望采用另一种解决方法来解决它,而不依赖于value_countsscipy.stats,您可以使用Counter集合。

from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

这可以应用于上面的示例,像这样:
src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

src.groupby(['Country','City']).agg(get_most_common)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接