Python pandas线性回归groupby

21

我正在尝试在按组分组的pandas python数据框上使用线性回归:

这是数据框df:

  group      date      value
    A     01-02-2016     16 
    A     01-03-2016     15 
    A     01-04-2016     14 
    A     01-05-2016     17 
    A     01-06-2016     19 
    A     01-07-2016     20 
    B     01-02-2016     16 
    B     01-03-2016     13 
    B     01-04-2016     13 
    C     01-02-2016     16 
    C     01-03-2016     16 

#import standard packages
import pandas as pd
import numpy as np

#import ML packages
from sklearn.linear_model import LinearRegression

#First, let's group the data by group
df_group = df.groupby('group')

#Then, we need to change the date to integer
df['date'] = pd.to_datetime(df['date'])  
df['date_delta'] = (df['date'] - df['date'].min())  / np.timedelta64(1,'D')

现在我想要预测每个组在2016年01月10日的价值。

我想要得到一个类似于这样的新数据框:

group      01-10-2016
  A      predicted value
  B      predicted value
  C      predicted value

这个如何在groupby中应用statsmodels的OLS不起作用。

for group in df_group.groups.keys():
      df= df_group.get_group(group)
      X = df['date_delta'] 
      y = df['value']
      model = LinearRegression(y, X)
      results = model.fit(X, y)
      print results.summary()

我遇到了以下错误

ValueError: Found arrays with inconsistent numbers of samples: [ 1 52]

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and   willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)

更新:

我已经将其更改为

for group in df_group.groups.keys():
      df= df_group.get_group(group)
      X = df[['date_delta']]
      y = df.value
      model = LinearRegression(y, X)
      results = model.fit(X, y)
      print results.summary()

现在我遇到了这个错误:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

1
@ayhan - 已完成!谢谢。 - jeangelj
1
你在循环中破坏了你的 df 变量。 - piRSquared
我认为问题出在你的构造函数调用上。你正在将值数组y传递给构造函数的fit_intercept参数,该参数需要一个布尔值,并且你正在将另一个数组X传递给copy_X布尔值。 - Oliver Dain
你想让date_delta相对于整个df的最小日期进行计算吗? - piRSquared
我想计算未来(2016年1月10日)每个组的日期差值。 - jeangelj
3个回答

14

新答案

def model(df, delta):
    y = df[['value']].values
    X = df[['date_delta']].values
    return np.squeeze(LinearRegression().fit(X, y).predict(delta))

def group_predictions(df, date):
    date = pd.to_datetime(date)
    df.date = pd.to_datetime(df.date)

    day = np.timedelta64(1, 'D')
    mn = df.date.min()
    df['date_delta'] = df.date.sub(mn).div(day)

    dd = (date - mn) / day

    return df.groupby('group').apply(model, delta=dd)

demo

group_predictions(df, '01-10-2016')

group
A    22.333333333333332
B     3.500000000000007
C                  16.0
dtype: object

旧回答

你使用 LinearRegression 的方法是错误的。

  • 你没有用数据调用它,也没有用数据拟合。只需像这样调用类即可
    • model = LinearRegression()
  • 然后使用 fit 拟合
    • model.fit(X, y)

但所有这些只是设置存储在 model 中的对象的值。没有很好的 summary 方法。可能有一个地方有,但我知道 statsmodels 中的一个,所以请看下面


选项 1
使用 statsmodels 代替

from statsmodels.formula.api import ols

for k, g in df_group:
    model = ols('value ~ date_delta', g)
    results = model.fit()
    print(results.summary())

                        OLS Regression Results                            
==============================================================================
Dep. Variable:                  value   R-squared:                       0.652
Model:                            OLS   Adj. R-squared:                  0.565
Method:                 Least Squares   F-statistic:                     7.500
Date:                Fri, 06 Jan 2017   Prob (F-statistic):             0.0520
Time:                        10:48:17   Log-Likelihood:                -9.8391
No. Observations:                   6   AIC:                             23.68
Df Residuals:                       4   BIC:                             23.26
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.3333      1.106     12.965      0.000        11.264    17.403
date_delta     1.0000      0.365      2.739      0.052        -0.014     2.014
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.393
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.461
Skew:                          -0.649   Prob(JB):                        0.794
Kurtosis:                       2.602   Cond. No.                         5.78
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  value   R-squared:                       0.750
Model:                            OLS   Adj. R-squared:                  0.500
Method:                 Least Squares   F-statistic:                     3.000
Date:                Fri, 06 Jan 2017   Prob (F-statistic):              0.333
Time:                        10:48:17   Log-Likelihood:                -3.2171
No. Observations:                   3   AIC:                             10.43
Df Residuals:                       1   BIC:                             8.631
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     15.5000      1.118     13.864      0.046         1.294    29.706
date_delta    -1.5000      0.866     -1.732      0.333       -12.504     9.504
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   3.000
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.531
Skew:                          -0.707   Prob(JB):                        0.767
Kurtosis:                       1.500   Cond. No.                         2.92
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  value   R-squared:                        -inf
Model:                            OLS   Adj. R-squared:                   -inf
Method:                 Least Squares   F-statistic:                    -0.000
Date:                Fri, 06 Jan 2017   Prob (F-statistic):                nan
Time:                        10:48:17   Log-Likelihood:                 63.481
No. Observations:                   2   AIC:                            -123.0
Df Residuals:                       0   BIC:                            -125.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     16.0000        inf          0        nan           nan       nan
date_delta -3.553e-15        inf         -0        nan           nan       nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.400
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.333
Skew:                           0.000   Prob(JB):                        0.846
Kurtosis:                       1.000   Cond. No.                         2.62
==============================================================================

谢谢@piRSquared;有没有用线性回归的方法来做呢?我正在尝试创建一个数据框,其中包含未来日期每个组的预测值。使用OLS摘要方法,我必须手动找到每个组的公式并计算它们在01-10-2016的值。 - jeangelj
非常感谢您;我还有一个问题,以便我完全理解您是如何解决这个问题的。如果我的日期在数据集中的格式为“2016-01-10”,那么代码会如何改变? - jeangelj
我得到了这个错误 TypeError: ufunc减法不能使用dtype('<M8[ns]')和dtype('O')类型的操作数。 - jeangelj
@jeangelj,我已经更新帖子确保 df.datedatetime 格式,以防它没有被传递。此外,请查看pd.to_datetime的文档,您可能想使用 dayfirst 参数:pd.to_datetime(dayfirst=True) - piRSquared
谢谢@piRSquared - 不幸的是,当我运行group_predictions(df, '01-10-2016')时,我仍然得到相同的错误;它指向这一行df['date_delta'] = df.date.sub(mn).div(day)。 - jeangelj

2

可能有些晚了,但我还是回答一下,如果有人遇到同样的问题,可以参考我的回答。实际上,除了回归块之外,其余都正确。以下是该实现的两个问题:

  • 请注意model.fit(X,y)需要一个形状为(n_samples,n_features)的X {array-like,sparse matrix}输入。因此,model.fit(X,y)的两个输入都应为2D。您可以使用reshape(-1,1)命令轻松将1D系列转换为2D。

  • 第二个问题是回归拟合过程本身:y和X不是model = LinearRegression(y,X)的输入,而是`model.fit(X,y)'的输入。

以下是回归块的修改:

for group in df_group.groups.keys():
      df= df_group.get_group(group)
      X = np.array(df[['date_delta']]).reshape(-1, 1) # note that series does not have reshape function, thus you need to convert to array
      y = np.array(df.value).reshape(-1, 1) 
      model = LinearRegression()  # <--- this does not accept (X, y)
      results = model.fit(X, y)
      print results.summary()

1
作为新手,我无法评论,因此我将其写成新答案。 解决错误的方法:
Runtime Error: ValueError : Expected 2D array, got scalar array instead

你需要在该行中重新调整 delta 值的形状:
return np.squeeze(LinearRegression().fit(X, y).predict(np.array(delta).reshape(1, -1)))

信用归属于你 piRSquared


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接