高效地在 pandas DataFrame 列中找出连续的序列？

Question

高效地在 pandas DataFrame 列中找出连续的序列？

9

我有一个类似下面的DataFrame，并且想要在它上面添加一个“连胜”列（参见下面的示例）:

Date         Home_Team    Away_Team    Winner      Streak

2005-08-06       A            G           A           0
2005-08-06       B            H           H           0
2005-08-06       C            I           C           0
2005-08-06       D            J           J           0
2005-08-06       E            K           K           0
2005-08-06       F            L           F           0
2005-08-13       A            B           A           1           
2005-08-13       C            D           D           1           
2005-08-13       E            F           F           0        
2005-08-13       G            H           H           0
2005-08-13       I            J           J           0
2005-08-13       K            L           K           1
2005-08-20       B            C           B           0
2005-08-20       A            D           A           2
2005-08-20       G            K           K           0
2005-08-20       I            E           E           0
2005-08-20       F            H           F           2
2005-08-20       J            L           J           2
2005-08-27       A            H           A           3
2005-08-27       B            F           B           1
2005-08-27       J            C           C           3           
2005-08-27       D            E           D           0
2005-08-27       I            K           K           0
2005-08-27       L            G           G           0
2005-09-05       B            A           A           2
2005-09-05       D            C           D           1
2005-09-05       F            E           F           0
2005-09-05       H            G           H           0
2005-09-05       J            I           I           0
2005-09-05       K            L           K           4

这个DataFrame有大约20万行数据，时间跨度从2005年到2020年。

现在，我想要做的是，在DataFrame中找到在“Date”列中日期之前主队连续获胜的场次。我有一个解决方案，但速度太慢，如下所示：

df["Streak"] = 0
def home_streak(x): # x is a row of the DataFrame
    """Keep track of a team's winstreak"""
    home_team = x["Home_Team"]
    date = x["Date"]
    
    # all previous matches for the home team 
    home_df = df[(df["Home_Team"] == home_team) | (df["Away_Team"] == home_team)]
    home_df = home_df[home_df["Date"] <  date].sort_values(by="Date", ascending=False).reset_index()
    if len(home_df.index) == 0: # no previous matches for that team, so start streak at 0
        return 0
    elif home_df.iloc[0]["Winner"] != home_team: # lost the last match
        return 0
    else: # they won the last game
        winners = home_df["Winner"]
        streak = 0
        for i in winners.index:
            if home_df.iloc[i]["Winner"] == home_team:
                streak += 1
            else: # they lost, return the streak
                return streak

df["Streak"] = df.apply(lambda x: home_streak(x), axis = 1)

我该如何加快这个过程的速度？

- the man

所以你想按主队分组并计算连续胜利？ - Mad Physicist

如果A作为客队获胜会发生什么？如果它输了呢？这是否会继续/结束连胜纪录？还是信息会丢失？ - Mad Physicist

它会延续/结束连胜纪录。与连胜相关的是主队而不是客队，我只想要主队的连胜纪录。 - the man

1

明白了，我只是想确保在回答之前完全理解这个问题。最后一个问题：您是否接受使用 numpy 而不是 pandas 解决方案？ - Mad Physicist

嘿，请更正您的问题，它很令人困惑。这并不是主场连胜，而是连胜纪录。 - Suryaveer Singh

显示剩余15条评论

4个回答

3

优雅的方式：

new_df = (df.reset_index()
            .melt(['index', 'Date', 'Winner'])
            .assign(win=lambda x: x['value'].eq(x.Winner))
            .sort_values('Date')
            .assign(cum_wins=lambda x: x.groupby('value')['win'].cumsum())
            .assign(cum_wins_prev=lambda x: x.groupby('value')['cum_wins'].shift(fill_value=0))
            .pivot_table(index='index', values='cum_wins_prev', columns='variable')
            .add_prefix('Streak_')
         )
print(new_df)

variable  Streak_Away_Team  Streak_Home_Team
index                                       
0                      0.0               0.0
1                      0.0               0.0
2                      0.0               0.0
3                      0.0               0.0
4                      0.0               0.0
5                      0.0               0.0
6                      0.0               1.0
7                      0.0               1.0
8                      1.0               0.0
9                      1.0               0.0
10                     1.0               0.0
11                     0.0               1.0
12                     1.0               0.0
13                     1.0               2.0
14                     2.0               0.0
15                     0.0               0.0
16                     2.0               2.0
17                     0.0               2.0

#new_df = df.assign(**new_df) #you could use join or assign 
new_df = df.join(new_df) 
print(new_df)



          Date Home_Team Away_Team Winner  Streak_Away_Team  Streak_Home_Team
0   2005-08-06         A         G      A               0.0               0.0
1   2005-08-06         B         H      H               0.0               0.0
2   2005-08-06         C         I      C               0.0               0.0
3   2005-08-06         D         J      J               0.0               0.0
4   2005-08-06         E         K      K               0.0               0.0
5   2005-08-06         F         L      F               0.0               0.0
6   2005-08-13         A         B      A               0.0               1.0
7   2005-08-13         C         D      D               0.0               1.0
8   2005-08-13         E         F      F               1.0               0.0
9   2005-08-13         G         H      H               1.0               0.0
10  2005-08-13         I         J      J               1.0               0.0
11  2005-08-13         K         L      K               0.0               1.0
12  2005-08-20         B         C      B               1.0               0.0
13  2005-08-20         A         D      A               1.0               2.0
14  2005-08-20         G         K      K               2.0               0.0
15  2005-08-20         I         E      E               0.0               0.0
16  2005-08-20         F         H      F               2.0               2.0
17  2005-08-20         J         L      J               0.0               2.0

据了解，一个团队每天只能参与一次比赛。

时间

%%timeit
df["Streak"] = 0
def home_streak(x): # x is a row of the DataFrame
    """Keep track of a team's winstreak"""
    home_team = x["Home_Team"]
    date = x["Date"]
    
    # all previous matches for the home team 
    home_df = df[(df["Home_Team"] == home_team) | (df["Away_Team"] == home_team)]
    home_df = home_df[home_df["Date"] <  date].sort_values(by="Date", ascending=False).reset_index()
    if len(home_df.index) == 0: # no previous matches for that team, so start streak at 0
        return 0
    elif home_df.iloc[0]["Winner"] != home_team: # lost the last match
        return 0
    else: # they won the last game
        winners = home_df["Winner"]
        streak = 0
        for i in winners.index:
            if home_df.iloc[i]["Winner"] == home_team:
                streak += 1
            else: # they lost, return the streak
                return streak

df["Streak"] = df.apply(lambda x: home_streak(x), axis = 1)

66.2 ms ± 9.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit

new_df = (df.reset_index()
            .melt(['index', 'Date', 'Winner'])
            .assign(win=lambda x: x['value'].eq(x.Winner))
            .sort_values('Date')
            .assign(cum_wins=lambda x: x.groupby('value')['win'].cumsum())
            .assign(cum_wins_prev=lambda x: x.groupby('value')['cum_wins'].shift(fill_value=0))
            .pivot_table(index='index', values='cum_wins_prev', columns='variable')
            .add_prefix('Streak_')
         )
new_df=df.assign(**new_df)

29.5 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

- ansev

这个对我也不起作用，我的连胜值（100+）是错误的。 - the man

一个团队一天内可以参加多次比赛吗？ - ansev

不，每天最多一次。 - the man

1

我想不出一个用 pandas 的解决方案，但你可以使用ngroup分配组号，然后使用defaultdict创建组，这样你就可以查找累积结果：

from collections import defaultdict

d = defaultdict(lambda: defaultdict(int))

df["group"] = df.groupby("Date").ngroup()

for a, b in zip(df["Winner"], df["group"]):
    d[b][a] = 1+d.get(b-1,{}).get(a, 0)

df["Streak"] = [d.get(y-1, {}).get(x, 0) for x, y in zip(df["Home_Team"], df["group"])]

print (df.drop("group", 1))

          Date Home_Team Away_Team Winner  Streak
0   2005-08-06         A         G      A       0
1   2005-08-06         B         H      H       0
2   2005-08-06         C         I      C       0
3   2005-08-06         D         J      J       0
4   2005-08-06         E         K      K       0
5   2005-08-06         F         L      F       0
6   2005-08-13         A         B      A       1
7   2005-08-13         C         D      D       1
8   2005-08-13         E         F      F       0
9   2005-08-13         G         H      H       0
10  2005-08-13         I         J      J       0
11  2005-08-13         K         L      K       1
12  2005-08-20         B         C      B       0
13  2005-08-20         A         D      A       2
14  2005-08-20         G         K      K       0
15  2005-08-20         I         E      E       0
16  2005-08-20         F         H      F       2
17  2005-08-20         J         L      J       2
18  2005-08-27         A         H      A       3
19  2005-08-27         B         F      B       1
20  2005-08-27         J         C      C       3
21  2005-08-27         D         E      D       0
22  2005-08-27         I         K      K       0
23  2005-08-27         L         G      G       0
24  2005-09-05         B         A      A       2
25  2005-09-05         D         C      D       1
26  2005-09-05         F         E      F       0
27  2005-09-05         H         G      H       0
28  2005-09-05         J         I      I       0
29  2005-09-05         K         L      K       4

- Henry Yik

我现在添加了更多的数据。 - the man

它运行得非常好。你的问题在别处，可能是数据不一致。 - Henry Yik

它不应该改变任何东西。将会有一个名为“tie”的新组，但它不会影响其他组。您确定团队名称在您的数据中始终保持不变，没有拼写错误吗？ - Henry Yik

是的，它们是一致的，因为这也会破坏我的解决方案（它运行良好）。 - the man

方案1对我无效；似乎没有明显的连胜模式，所以我不确定发生了什么。它绝对不会在平局/失败后重置为零。 - the man

显示剩余8条评论

0

正在修复中！

这里可能是最简单的方法 -

def get_streak(l,m,n):
    wins = np.roll(np.cumsum([1 if i==n else 0 for i in l]),1)
    wins[0]=0
    filts = np.array([1 if i==n else 0 for i in m])
    mul = np.multiply(wins, filts)
    return mul


streaks = np.zeros((30,)).astype(int)
l = list(df['Winner'])
m = list(df['Home_Team'])

for i in df['Winner'].unique():
    streaks += get_streak(l,m,i)
    
df['streaks'] = streaks

          Date Home_Team Away_Team Winner  streaks
0   2005-08-06         A         G      A        0
1   2005-08-06         B         H      H        0
2   2005-08-06         C         I      C        0
3   2005-08-06         D         J      J        0
4   2005-08-06         E         K      K        0
5   2005-08-06         F         L      F        0
6   2005-08-13         A         B      A        1
7   2005-08-13         C         D      D        1
8   2005-08-13         E         F      F        0
9   2005-08-13         G         H      H        0
10  2005-08-13         I         J      J        0
11  2005-08-13         K         L      K        1
12  2005-08-20         B         C      B        0
13  2005-08-20         A         D      A        2
14  2005-08-20         G         K      K        0
15  2005-08-20         I         E      E        0
16  2005-08-20         F         H      F        2
17  2005-08-20         J         L      J        2
18  2005-08-27         A         H      A        3
19  2005-08-27         B         F      B        1
20  2005-08-27         J         C      C        3
21  2005-08-27         D         E      D        1
22  2005-08-27         I         K      K        0
23  2005-08-27         L         G      G        0
24  2005-09-05         B         A      A        2
25  2005-09-05         D         C      D        2
26  2005-09-05         F         E      F        3
27  2005-09-05         H         G      H        2
28  2005-09-05         J         I      I        3
29  2005-09-05         K         L      K        4

这很简单 -

首先，您需要计算给定球队的胜利次数的累积和，并将它们向右移一位。
然后，您需要将它们与它们是主场球队时的实例进行点积。将结果保存在名为“streak”的向量中
然后，您需要循环遍历所有唯一的主场球队并计算他们的连胜总和。
完成!

通过一些打印语句可以更直观地了解函数的工作原理 -

def get_streak(l,m,n):
    wins = np.roll(np.cumsum([1 if i==n else 0 for i in l]),1)
    wins[0]=0
    print('wins:',wins)
    filts = np.array([1 if i==n else 0 for i in m])
    print('home:',filts)
    mul = np.multiply(wins, filts)
    print('strk:', mul)
    return mul

streak_A = get_streak(l,m,'A')

wins: [0 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5]
home: [1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
strk: [0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0]

你要找的是所有连续序列的逐元素求和。

基准测试（似乎是所有其他答案中最快的）-

529 µs ± 20.6 µs每个循环（平均值±7次运行的标准偏差，每个1000个循环）

- Akshay Sehgal

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mad Physicist · Accepted Answer

我将在此提供一个基于numpy的解决方案。首先，因为我不是很熟悉pandas并且不想做研究，其次，无论如何numpy解决方案都应该可以正常工作。

让我们首先来看看一个给定团队发生了什么。您的目标是根据团队参与的比赛序列找到连续胜利的次数。我会删除日期列并将您的数据首先转换为numpy数组:

x = np.array([
    ['A', 'G', 'A'],
    ['B', 'H', 'H'],
    ['C', 'I', 'C'],
    ['D', 'J', 'J'],
    ['E', 'K', 'K'],
    ['F', 'L', 'F'],
    ['A', 'B', 'A'],
    ['C', 'D', 'D'],
    ['E', 'F', 'F'],
    ['G', 'H', 'H'],
    ['I', 'J', 'J'],
    ['K', 'L', 'K'],
    ['B', 'C', 'B'],
    ['A', 'D', 'A'],
    ['G', 'K', 'K'],
    ['I', 'E', 'E'],
    ['F', 'H', 'F'],
    ['J', 'L', 'J']])

你不需要日期，因为你关心的只是谁参加比赛了，即使他们在同一天进行了多次比赛。所以让我们只看一下 A 队：

A_played = np.flatnonzero((x[:, :2] == 'A').any(axis=1))
A_won = x[A_played, -1] == 'A'

A_played是一个索引数组，它的元素数量与x中的行数相同。 A_won是一个掩码，其元素数量与np.count_nonzero(A_played)相同；也就是说，A参与的比赛数量。

查找连胜/连败的长度是一个相当成熟的问题：

streaks = np.diff(np.flatnonzero(np.diff(np.r_[False, A_won, False])))[::2]

你需要计算掩码转换值的每一对索引之间的差异。使用 False 进行额外填充，确保你知道掩码的切换方向。所寻求的结果基于此计算，但需要更多细节，因为需要计算累积和，但在每次运行后重置。你可以通过将数据的值设置为负的运行长度来实现：

wins = np.r_[0, A_won, 0]  # Notice the int dtype here
switch_indices = np.flatnonzero(np.diff(wins)) + 1
streaks = np.diff(switch_indices)[::2]
wins[switch_indices[1::2]] = -streaks

现在您拥有一个可修剪的数组，其累积总和可以直接分配给输出列：

streak_counts = np.cumsum(wins[:-2])
output = np.zeros((x.shape[0], 2), dtype=int)

# Home streak
home_mask = x[A_played, 0] == 'A'
output[A_played[home_mask], 0] = streak_counts[home_mask]

# Away streak
away_mask = ~home_mask
output[A_played[away_mask], 1] = streak_counts[away_mask]

现在，您可以循环遍历所有团队（相对于所有比赛的总数来说应该是相当少的）:

def process_team(data, team, output):
    played = np.flatnonzero((data[:, :2] == team).any(axis=1))
    won = data[played, -1] == team
    wins = np.r_[0, won, 0]
    switch_indices = np.flatnonzero(np.diff(wins)) + 1
    streaks = np.diff(switch_indices)[::2]
    wins[switch_indices[1::2]] = -streaks
    streak_counts = np.cumsum(wins[:-2])

    home_mask = data[played, 0] == team
    away_mask = ~home_mask

    output[played[home_mask], 0] = streak_counts[home_mask]
    output[played[away_mask], 1] = streak_counts[away_mask]

output = np.empty((x.shape[0], 2), dtype=int)

# Assume every team has been home team at least once.
# If not, x[:, :2].ravel() copies the data and np.unique(x[:, :2]) does too
for team in set(x[:, 0]):
    process_team(x, team, output)