Pandas如何使用apply和lambda应用两种不同的运算符？

Question

Pandas如何使用apply和lambda应用两种不同的运算符？

6

这个问题与我之前发布的问题非常相似，只有一个变化。除了对所有列进行绝对差异之外，我还想查找“Z”列的幅度差异，因此，如果当前的Z比先前的值大1.1倍，则保留它。

（问题的更多背景信息）

df = pd.DataFrame({
    'rank': [1, 1, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 4, 2],
    'y': [0, 4, 0, 4, 5, 5],
    'z': [1, 3, 1.2, 3.25, 3, 6],
})
print(df)
#    rank  x  y     z
# 0     1  0  0  1.00
# 1     1  3  4  3.00
# 2     2  0  0  1.20
# 3     2  3  4  3.25
# 4     3  4  5  3.00
# 5     3  2  5  6.00

这是我希望输出的结果

output = pd.DataFrame({
    'rank': [1, 1, 2, 3],
    'x': [0, 3, 0, 2],
    'y': [0, 4, 0, 5],
    'z': [1, 3, 1.2, 6],
})
print(output)
#    rank  x  y    z
# 0     1  0  0  1.0
# 1     1  3  4  3.0
# 2     2  0  0  1.2
# 5     3  2  5  6.00

我希望发生的情况是，如果前一个等级中有任何行具有x、y（两边都±1）和z（<1.1z），则将其删除。

因此，对于排名1的行，在排名2中具有任何x = (-1-1)、y = (-1-1)、z=(<1.1)或x=(2-5)、y=(3-5)、z=(<3.3)组合的行都应该被删除。

- mike_gundy123

你能更正式地介绍一下筛选条件吗？ - n49o7

每个等级的行数总是相同的吗？ - onepan

@onepan 不是的，不同的排列可以有不同数量的行。 - mike_gundy123

5个回答

2

您需要略微修改我之前的代码：

def check_previous_group(rank, d, groups):
    if not rank-1 in groups.groups:
        # check is a previous group exists, else flag all rows False (i.e. not to be dropped)
        return pd.Series(False, index=d.index)

    else:
        # get previous group (rank-1)
        d_prev = groups.get_group(rank-1)

        # get the absolute difference per row with the whole dataset 
        # of the previous group: abs(d_prev-s)
        # if all differences are within 1/1/0.1*z for x/y/z
        # for at least one rows of the previous group
        # then flag the row to be dropped (True)
        return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*s['z']]).all(1).any(), axis=1)

groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]

输出：

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
5     3  2  5  6.0

- mozway

1

谢谢您再次回复！但是这并不完全符合我的想象。如果我更改索引3 z: 3.31，即使3.31 > 3.00*1.1，它也不会显示在输出中。 - mike_gundy123

1

只需要调整链接帖子中的lambda方程的z项：

return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*d_prev['z']]).all(1).any(), axis=1)

Here's the full code that works for me:

df = pd.DataFrame({
    'rank': [1, 1, 2, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 3, 4, 2],
    'y': [0, 4, 0, 4, 4, 5, 5],
    'z': [1, 3, 1.2, 3.3, 3.31, 3, 6],
})


def check_previous_group(rank, d, groups):
    if not rank-1 in groups.groups:
        # check is a previous group exists, else flag all rows False (i.e. not to be dropped)
        return pd.Series(False, index=d.index)

    else:
        # get previous group (rank-1)
        d_prev = groups.get_group(rank-1)

        # get the absolute difference per row with the whole dataset 
        # of the previous group: abs(d_prev-s)
        # if all differences are within 1/1/0.1*z for x/y/z
        # for at least one rows of the previous group
        # then flag the row to be dropped (True)
        return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*d_prev['z']]).all(1).any(), axis=1)

groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]

- cmay

对我不起作用，它显示“ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()。” - mike_gundy123

好的，我想我修复了它，我不是用.1*d_prev['z']，而是要用.1*s['z']。 - mike_gundy123

你可能想要检查一下，我相信这会将第二个z值的10%（表示为“s”）与之前的值（表示为d_prev）进行比较。我认为只需使用先前的代码，并像我所做的那样使用d_prev，就应该从上一行中获得10%的结果。 - cmay

嗯，你说得对，但是 d_prev 被视为整个数据框，而 s（据我理解）是 d 中的每一行，所以我不能直接执行 d_prev[['z']]。希望这样说得清楚了。 - mike_gundy123

刚刚添加了完整的实现，适用于我使用的3.3和3.31版本，已经排除和包含。 - cmay

1

也许我只是太蠢了，但这对我不起作用（我复制并粘贴了你的代码）。我收到一个错误，说“ValueError：无法将<class'int'>的列表强制转换为Series / DataFrame”。我认为这来自于这里：“le([1,1,.1d_prev['z']])”，所以我将其更改为：“le([1,1,.1d_prev[['z']]])”。这给了我我在第一次回复中提到的错误。 - mike_gundy123

1

这对我在Python 3.8.6上有效。

import pandas as pd

dfg = df.groupby("rank")

def filter_func(dfg):
    for g in dfg.groups.keys():
        if g-1 in dfg.groups.keys():
            yield (
                pd.merge(
                    dfg.get_group(g).assign(id = lambda df: df.index), 
                    dfg.get_group(g-1),
                    how="cross", suffixes=("", "_prev")
                ).assign(
                    cond = lambda df: ~(
                        (df.x - df.x_prev).abs().le(1) & (df.y - df.y_prev).abs().le(1) & df.z.divide(df.z_prev).lt(1.1)
                    )
                )
            ).groupby("id").agg(
                {
                    **{"cond": "all"},
                    **{k: "first" for k in df.columns}
                }).loc[lambda df: df.cond].drop(columns = ["cond"])
        else:
            yield dfg.get_group(g)

pd.concat(
    filter_func(dfg), ignore_index=True
)

输出似乎与您期望的相匹配：

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
3     3  2  5  6.0

小修改：在你的问题中，似乎你关心行索引。我发布的解决方案只是忽略了这一点，但如果你想保留它，只需将其保存为数据框中的附加列即可。

- Shffl

1

我已经修改了mozway的函数，使其符合您的要求。

# comparing 'equal' float values, may go wrong, that's why I am using this constant
DELTA=0.1**12

def check_previous_group(rank, d, groups):
    if not rank-1 in groups.groups:
        # check if a previous group exists, else flag all rows False (i.e. not to be dropped)
        #return pd.Series(False, index=d.index)
        return pd.Series(False, index=d.index)

    else:
        # get previous group (rank-1)
        d_prev = groups.get_group(rank-1)

        # get the absolute difference per row with the whole dataset 
        # of the previous group: abs(d_prev-s)
        # if differences in x and y are within 1 and z < 1.1*x
        # for at least one row of the previous group
        # then flag the row to be dropped (True)
        
        return d.apply(lambda s: (abs(d_prev-s)[['x', 'y']].le([1,1]).all(1)&
                                  (s['z']<1.1*d_prev['x']-DELTA)).any(), axis=1)

测试，

>>> df = pd.DataFrame({
    'rank': [1, 1, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 4, 2],
    'y': [0, 4, 0, 4, 5, 5],
    'z': [1, 3, 1.2, 3.25, 3, 6],
})

>>> df

   rank  x  y     z
0     1  0  0  1.00
1     1  3  4  3.00
2     2  0  0  1.20
3     2  3  4  3.25
4     3  4  5  3.00
5     3  2  5  6.00

>>> groups = df.groupby('rank')
>>> mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
>>> df[~mask]

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
5     3  2  5  6.0

>>> df = pd.DataFrame({
    'rank': [1, 1, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 4, 2],
    'y': [0, 4, 0, 4, 5, 5],
    'z': [1, 3, 1.2, 3.3, 3, 6],
})

>>> df

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
3     2  3  4  3.3
4     3  4  5  3.0
5     3  2  5  6.0


>>> groups = df.groupby('rank')
>>> mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
>>> df[~mask]

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
3     2  3  4  3.3
5     3  2  5  6.0

- Arislan Makhmudov

@mike_gundy123，感谢您的赞赏！我使用DELTA，因为浮点数值总是近似的，所以1.13≠3.3确切地说，它可能是1.13=3.299999999999999，因此会引起混淆。无论如何，请查看这个答案，您将找到更好的解释：https://dev59.com/tG035IYBdhLWcg3wJcjT - Arislan Makhmudov

哦不，不是那个意思。我是说你在那一行比较了“Z”和“X”，但你应该比较“Z”和“Z”。 - mike_gundy123

@mike_gundy123 但是你不是要求它吗？我引用了你的问题：

基本上我想要发生的是，如果先前的排名有任何行具有x、y（两边+-1）和z（<1.1x），则将其删除。

这里的z（<1.1x）是将z与1.1*x进行比较吗？ - Arislan Makhmudov

无论如何，您可以将 x 更改为 z，我相信代码将正常工作。 - Arislan Makhmudov

lmaoooooo，那是我自己的错。 - mike_gundy123

显示剩余3条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Code Different · Accepted Answer

这里有一个使用 numpy广播的解决方案：

# Initially, no row is dropped
df['drop'] = False

for r in range(df['rank'].min(), df['rank'].max()):
    # Find the x_min, x_max, y_min, y_max, z_max of the current rank
    cond = df['rank'] == r
    x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T
    x_min, x_max = x + [[-1], [1]] # use numpy broadcasting to ±1 in one command
    y_min, y_max = y + [[-1], [1]]
    z_max        = z * 1.1

    # Find the x, y, z of the next rank. Raise them one dimension
    # so that we can make a comparison matrix again x_min, x_max, ...
    cond = df['rank'] == r + 1
    if not cond.any():
        continue
    x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T[:, :, None]

    # Condition to drop a row
    drop = (
        (x_min <= x) & (x <= x_max) &
        (y_min <= y) & (y <= y_max) &
        (z <= z_max)
    ).any(axis=1)
    df.loc[cond, 'drop'] = drop

# Result
df[~df['drop']]

简化版

更简化的版本（很可能更快）。这是困扰未来团队成员阅读代码的绝佳方式：

r, x, y, z = df[['rank', 'x', 'y', 'z']].T.to_numpy()
rr, xx, yy, zz = [col[:,None] for col in [r, x, y, z]]

drop = (
    (rr == r + 1) &
    (x-1 <= xx) & (xx <= x+1) &
    (y-1 <= yy) & (yy <= y+1) &
    (zz <= z*1.1)
).any(axis=1)

# Result
df[~drop]

此操作会将df中的每一行都进行比较（包括自身），并当以下条件成立时返回True（即删除）：

当前行的rank == 其他行的rank + 1；且
当前行的x, y, z在其他行的x, y, z指定范围内。