Pandas:查找每行中频繁出现的值

4

我有一个包含二进制值的数据集,我想找出每一行中经常出现的值。这个数据集有几百万条记录。那么,最有效的方法是什么呢?以下是数据集的示例:

import pandas as pd
data = pd.read_csv('myData.csv', sep = ',')
data.head()
bit1    bit2    bit2    bit4    bit5    frequent    freq_count
0       0       0       1       1       0           3
1       1       1       0       0       1           3
1       0       1       1       1       1           4

我希望创建类似上面示例的frequentfreq_count列。这些不是原始数据集的一部分,将在查看所有行后创建。

2个回答

2
您可以使用 scipy.stats.mode 函数进行解决:
from scipy import stats

a = df.values.T
b = stats.mode(a)
print(b)
ModeResult(mode=array([[0, 1, 1]], dtype=int64), count=array([[3, 3, 4]]))

df['frequent'] = b[0][0]
df['freq_count'] = b[1][0]
print (df)
   bit1  bit2  bit2.1  bit4  bit5  frequent  freq_count
0     0     0       0     1     1         0           3
1     1     1       1     0     0         1           3
2     1     0       1     1     1         1           4

使用 Counter.most_common 方法:
from collections import Counter

def f(x):
    a, b = Counter(x).most_common(1)[0]
    return pd.Series([a, b])

df[['frequent','freq_count']] = df.apply(f, axis=1)

另一种解决方案:

def f(x):
    counts = np.bincount(x)
    a = np.argmax(counts)
    b = np.max(counts)
    return pd.Series([a,b])

df[['frequent','freq_count']] = df.apply(f, axis=1)

替代方案:

from collections import defaultdict

def f(x):
    d = defaultdict(int)
    for i in x:
        d[i] += 1
    return pd.Series(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])


df[['frequent','freq_count']] = df.apply(f, axis=1)

时间:

np.random.seed(100)
N = 10000
#[10000 rows x 20 columns]
df = pd.DataFrame(np.random.randint(2, size=(N,20)))

In [140]: %timeit df.apply(f1, axis=1)
1 loop, best of 3: 1.78 s per loop

In [141]: %timeit df.apply(f2, axis=1)
1 loop, best of 3: 1.66 s per loop

In [142]: %timeit df.apply(f3, axis=1)
1 loop, best of 3: 1.7 s per loop

In [143]: %timeit mod(df)
100 loops, best of 3: 8.37 ms per loop

In [144]: %timeit mod1(df)
100 loops, best of 3: 8.88 ms per loop

from collections import Counter
from collections import defaultdict
from scipy import stats

def f1(x):
    a, b = Counter(x).most_common(1)[0]
    return pd.Series([a, b])

def f2(x):
    counts = np.bincount(x)
    a = np.argmax(counts)
    b = np.max(counts)
    return pd.Series([a,b])

def f3(x):
    d = defaultdict(int)
    for i in x:
        d[i] += 1
    return pd.Series(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])

def mod(df):
    a = df.values.T
    b = stats.mode(a)

    df['a'] = b[0][0]
    df['b'] = b[1][0]
    return df

def mod1(df):
    a = df.values
    b = stats.mode(a, axis=1)

    df['a'] = b[0][:, 0]
    df['b'] = b[1][:, 0]
    return df

2
这里有一种方法 -
def freq_stat(df):
    a = df.values
    zero_c = (a==0).sum(1)
    one_c = a.shape[1] - zero_c
    df['frequent'] = (zero_c<=one_c).astype(int)
    df['freq_count'] = np.maximum(zero_c, one_c)
    return df

样例运行 -

In [305]: df
Out[305]: 
   bit1  bit2  bit2.1  bit4  bit5
0     0     0       0     1     1
1     1     1       1     0     0
2     1     0       1     1     1

In [308]: freq_stat(df)
Out[308]: 
   bit1  bit2  bit2.1  bit4  bit5  frequent  freq_count
0     0     0       0     1     1         0           3
1     1     1       1     0     0         1           3
2     1     0       1     1     1         1           4

基准测试

我们来测试一下这个方法与@jezrael的解决方案中最快的方法:

from scipy import stats

def mod(df): # @jezrael's best soln 
    a = df.values.T
    b = stats.mode(a)

    df['a'] = b[0][0]
    df['b'] = b[1][0]
    return df

另外,让我们使用另一篇文章中的相同设置并获取时间 -

In [323]: np.random.seed(100)
     ...: N = 10000
     ...: #[10000 rows x 20 columns]
     ...: df = pd.DataFrame(np.random.randint(2, size=(N,20)))
     ...: 

# @jezrael's soln 
In [324]: %timeit mod(df)
100 loops, best of 3: 5.92 ms per loop

# Proposed in this post
In [325]: %timeit freq_stat(df)
1000 loops, best of 3: 496 µs per loop

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接