如何计算每行带有NaN值的最佳拟合线？

Question

如何计算每行带有NaN值的最佳拟合线？

3

我有一个数据集，其中存储了马拉松赛段的时间（5公里、10公里等）和标识符（年龄、性别、国家）作为列，并将个体作为行。每个马拉松赛段分割列的单元格可能包含浮点数（指到达该赛段所需的秒数），也可能是“NaN”。一行最多可能包含4个NaN值。以下是一些示例数据：

      Age M/F Country      5K     10K     15K     20K    Half  Official Time
2323   38   M     CHI  1300.0  2568.0  3834.0  5107.0  5383.0        10727.0
2324   23   M     USA  1286.0  2503.0  3729.0  4937.0  5194.0        10727.0
2325   36   M     USA  1268.0  2519.0  3775.0  5036.0  5310.0        10727.0
2326   37   M     POL  1234.0  2484.0  3723.0  4972.0  5244.0        10727.0
2327   32   M     DEN     NaN  2520.0  3782.0  5046.0  5319.0        10728.0

我打算为马拉松分段时间（只使用“5K”到“Half”列之间的数据）计算最佳拟合线，对于至少有一个NaN值的每一行，从该行的最佳拟合线中预测一个数据点来代替NaN值。

从样本数据中，我只想计算第2327行的最佳拟合线（使用2520.0、3782.0、5046.0和5319.0这些值）。使用这个最佳拟合线，我想用预测的5K时间替换NaN 5K时间。

如何为每一行的NaN值计算最佳拟合线？

提前感谢。

- user15936471

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David Erickson · Accepted Answer

我从这里 https://dev59.com/dl0Z5IYBdhLWcg3wiw0v#31340344 推断出了一个解决方案（双关语）。Extrapolation definition 我不确定在2021年，pandas 是否有可靠的外推方法，所以您可能需要使用scipy或其他库。在进行外推时，我排除了“半程马拉松”列。这是因为5K、10K、15K和20K的跑步距离是100%线性的。如果排除半程马拉松列，它就是一条直线。但是，这并不意味着预期的跑步时间是线性的。显然，随着跑步距离的增加，每公里平均时间会降低。但是，这样做可以“完成任务”，而不涉及过于复杂的计算。

此外，值得注意的是，假设第一列是1K而不是5K，那么这种方法将失败。它只能在距离线性的情况下起作用。如果是1K，除非你根据列名中的公里数进行计算，否则你还必须使用其他选手的行数据。无论哪种方式，这都是一个不完美的解决方案，但比pd.interpolation好得多。我在tdy答案的评论中链接了另一个潜在的解决方案。

import scipy as sp
import pandas as pd

# we focus on the four numeric columns from 5K-20K and and Transpose the dataframe, since we are going horizontally across columns. T
#T he index must also be numeric, so we drop it, but don't worry, we add back just the numbers and maintain the index later on.
df_extrap = df.iloc[:,4:8].T.reset_index(drop=True)

# create a scipy interpolation function to be called by a custom extrapolation function  later on
def scipy_interpolate_func(s):
    s_no_nan = s.dropna()
    return sp.interpolate.interp1d(s_no_nan.index.values, s_no_nan.values, kind='linear', bounds_error=False)


def my_extrapolate_func(scipy_interpolate_func, new_x):
    x1, x2 = scipy_interpolate_func.x[0], scipy_interpolate_func.x[-1]
    y1, y2 = scipy_interpolate_func.y[0], scipy_interpolate_func.y[-1]
    slope = (y2 - y1) / (x2 - x1)
    return y1 + slope * (new_x - x1)

#Concat each extrapolated column altogether and transpose back to initial shape to be added to the original dataframe
s_extrapolated = pd.concat([pd.Series(my_extrapolate_func(scipy_interpolate_func(df_extrap[s]), 
                                                              df_extrap[s].index.values), 
                                          index=df_extrap[s].index) for s in df_extrap.columns], axis=1).T
cols = ['5K', '10K', '15K', '20K']
df[cols] = s_extrapolated
df

Out[1]: 
   index  Age M/F Country      5K     10K     15K     20K    Half  \
0   2323   38   M     CHI  1300.0  2569.0  3838.0  5107.0  5383.0   
1   2324   23   M     USA  1286.0  2503.0  3720.0  4937.0  5194.0   
2   2325   36   M     USA  1268.0  2524.0  3780.0  5036.0  5310.0   
3   2326   37   M     POL  1234.0  2480.0  3726.0  4972.0  5244.0   
4   2327   32   M     DEN  1257.0  2520.0  3783.0  5046.0  5319.0   

   Official Time  
0        10727.0  
1        10727.0  
2        10727.0  
3        10727.0  
4        10728.0