如何在pandas数据框上应用定义函数

Question

如何在pandas数据框上应用定义函数

3

我定义了以下函数，它适用于二维数组。 angle 函数用于计算向量之间的夹角。

在调用以下函数时，它将以 "directions" 作为参数，该参数是一个2d数组（其中有2列，一列带有x值，另一列带有y值）。

现在，directions 是通过应用 np.diff() 函数2d数组获得的。

import matplotlib.pyplot as plt
import numpy as np
import os
import rdp

def angle(dir):
    """
    Returns the angles between vectors.

    Parameters:
    dir is a 2D-array of shape (N,M) representing N vectors in M-dimensional space.

    The return value is a 1D-array of values of shape (N-1,), with each value between 0 and pi.

    0 implies the vectors point in the same direction
    pi/2 implies the vectors are orthogonal
    pi implies the vectors point in opposite directions
    """
    dir2 = dir[1:]
    dir1 = dir[:-1]
    return np.arccos((dir1*dir2).sum(axis=1)/(np.sqrt((dir1**2).sum(axis=1)*(dir2**2).sum(axis=1))))

tolerance = 70
min_angle = np.pi*0.22

filename = os.path.expanduser('~/tmp/bla.data')
points = np.genfromtxt(filename).T
print(len(points))
x, y = points.T

# Use the Ramer-Douglas-Peucker algorithm to simplify the path
# http://en.wikipedia.org/wiki/Ramer-Douglas-Peucker_algorithm
# Python implementation: https://github.com/sebleier/RDP/
simplified = np.array(rdp.rdp(points.tolist(), tolerance))

print(len(simplified))
sx, sy = simplified.T

# compute the direction vectors on the simplified curve
directions = np.diff(simplified, axis=0)
theta = angle(directions)

# Select the index of the points with the greatest theta
# Large theta is associated with greatest change in direction.
idx = np.where(theta>min_angle)[0]+1

我希望将上述代码应用于轨迹数据的 pandas.DataFrame 上。

下面是示例 df。具有相同 subid 的 sx、sy 值被视为一个轨迹，例如行（0-3）与 2 具有相同的 subid，而 id 为 11 的点被视为一条轨迹上的点。行（4-6）也是一条轨迹。因此，每当 subid 或 id 发生更改时，就会发现单独的轨迹数据。

  id      subid     simplified_points     sx       sy
0 11      2         (3,4)                 3        4
1 11      2         (5,6)                 5        6
2 11      2         (7,8)                 7        8
3 11      2         (9,9)                 9        9
4 11      3         (10,12)               10       12
5 11      3         (12,14)               12       14
6 11      3         (13,15)               13       15
7 12      9         (18,20)               18       20
8 12      9         (22,24)               22       24
9 12      9         (25,27)               25       27

上述数据框已经应用了rdp算法，simplified_points进一步解压成两列sx和sy是rdp算法的结果。

问题在于如何获取每条轨迹的directions，然后随后获取theta和idx。由于上面的代码仅为一个轨迹实现，而且还是在2d数组上实现的，所以我无法将其实现到以上pandas数据框中。

请建议我一种方法来为df中的每个轨迹数据实现上述代码。

- Liza

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Stephen Rauch · Accepted Answer

您可以使用 pandas.DataFrame.groupby.apply() 处理每个 (id, subid)，示例代码如下：

代码：

def theta(group):
    dx = pd.Series(group.sx.diff(), name='dx')
    dy = pd.Series(group.sy.diff(), name='dy')
    theta = pd.Series(np.arctan2(dy, dx), name='theta')
    return pd.concat([dx, dy, theta], axis=1)

df2 = df.groupby(['id', 'subid']).apply(theta)

测试代码:

df = pd.read_fwf(StringIO(u"""
    id      subid     simplified_points     sx       sy
    11      2         (3,4)                 3        4
    11      2         (5,6)                 5        6
    11      2         (7,8)                 7        8
    11      2         (9,9)                 9        9
    11      3         (10,12)               10       12
    11      3         (12,14)               12       14
    11      3         (13,15)               13       15
    12      9         (18,20)               18       20
    12      9         (22,24)               22       24
    12      9         (25,27)               25       27"""),
                 header=1)

df2 = df.groupby(['id', 'subid']).apply(theta)
df = pd.concat([df, pd.DataFrame(df2.values, columns=df2.columns)], axis=1)
print(df)

结果：

   id  subid simplified_points  sx  sy   dx   dy     theta
0  11      2             (3,4)   3   4  NaN  NaN       NaN
1  11      2             (5,6)   5   6  2.0  2.0  0.785398
2  11      2             (7,8)   7   8  2.0  2.0  0.785398
3  11      2             (9,9)   9   9  2.0  1.0  0.463648
4  11      3           (10,12)  10  12  NaN  NaN       NaN
5  11      3           (12,14)  12  14  2.0  2.0  0.785398
6  11      3           (13,15)  13  15  1.0  1.0  0.785398
7  12      9           (18,20)  18  20  NaN  NaN       NaN
8  12      9           (22,24)  22  24  4.0  4.0  0.785398
9  12      9           (25,27)  25  27  3.0  3.0  0.785398