多项式回归使用statsmodels.formula.api

Question

多项式回归使用statsmodels.formula.api

8

请原谅我的无知。我只是想在回归中添加一个平方项，而不必麻烦地在数据框中定义一个新的列。我正在使用statsmodels.formula.api（作为stats），因为其格式类似于R语言，而我更熟悉R。

hours_model = stats.ols(formula='act_hours ~ h_hours + C(month) + trend', data = df).fit()

以上按预期工作。

hours_model = stats.ols(formula='act_hours ~ h_hours + h_hours**2 + C(month) + trend', data = df).fit()

这个省略了h_hours**2并返回与上一行相同的输出。

我还尝试过: h_hours^2，math.pow(h_hours,2)和poly(h_hours,2) 所有这些都会引发错误。

任何帮助将不胜感激。

- Matthew Withrow

h_hours 的数据类型是什么？如果它被视为分类变量，那么你需要将其转换为浮点数或其他被 patsy 视为数值型的类型。 - Josef

请查看sklearn.preprocessing.PolynomialFeatures，它会有所帮助。 - GIRISH kuniyal

@Josef，感谢您的回复。df ['h_hours']的数据类型为float64。 - Matthew Withrow

@GIRISHkuniyal，谢谢。我研究了一下，但我认为它不适合我想做的事情。我只是在寻找一个没有任何交互作用的平方项。 - Matthew Withrow

3个回答

4

目前，虽然 statsmodels 公式 API（实际上是 Patsy 库）不支持R中的 poly(variable, degree) 函数，但是NumPy的 vander(variable, degree+1) 可以完成此任务。请注意，np.vander() 会生成范德蒙矩阵，这意味着您也会得到拦截列！让我们看一个示例：

>> x = np.array([1, 2, 3, 5])
>> np.vander(x, 4, increasing=True)

array([[  1,   1,   1,   1],
       [  1,   2,   4,   8],
       [  1,   3,   9,  27],
       [  1,   5,  25, 125]])

所以，您需要通过在公式中添加-1来删除Patsy的内部截距：

hours_model = stats.ols(formula='act_hours ~ np.vander(h_hours, 3, increasing=True) - 1', data = df).fit()

请注意，需要传递 您所需的度数 + 1，因为第一列为 x^0=1。

- Mohammad-Reza Malekpour

0

这是基于Mohammad-Reza Malekpour的答案，但从Vandermonde矩阵中去除了截距列。

import numpy as np


def poly(x, degree):
    return np.vander(x, degree + 1, increasing=True)[:, 1:]

例子：

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

n = 30
np.random.seed(42)
x = np.random.uniform(0, 1, size=n)
y = 1 + 2 * x + 3 * x**2 + 4 * x**3 + np.random.normal(0, 0.001, size=n)
df = pd.DataFrame({"x": x, "y": y})

results = smf.ols(formula="y ~ poly(x, 3)", data=df).fit()
print(results.summary().tables[1])

=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         1.0009      0.001   1599.564      0.000       1.000       1.002
poly(x, 3)[0]     1.9935      0.006    361.673      0.000       1.982       2.005
poly(x, 3)[1]     3.0124      0.013    232.711      0.000       2.986       3.039
poly(x, 3)[2]     3.9919      0.009    463.711      0.000       3.974       4.010
=================================================================================

与之相比：

results = smf.ols(formula="y ~ x + I(x**2) + I(x**3)", data=df).fit()
print(results.summary().tables[1])

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.0009      0.001   1599.564      0.000       1.000       1.002
x              1.9935      0.006    361.673      0.000       1.982       2.005
I(x ** 2)      3.0124      0.013    232.711      0.000       2.986       3.039
I(x ** 3)      3.9919      0.009    463.711      0.000       3.974       4.010
==============================================================================

- tueda

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- StupidWolf · Accepted Answer

你可以尝试在Python中使用类似于R中的I()函数：

import statsmodels.formula.api as smf

np.random.seed(0)

df = pd.DataFrame({'act_hours':np.random.uniform(1,4,100),'h_hours':np.random.uniform(1,4,100),
                  'month':np.random.randint(0,3,100),'trend':np.random.uniform(0,2,100)})

model = 'act_hours ~ h_hours + I(h_hours**2)'
hours_model = smf.ols(formula = model, data = df)

hours_model.exog[:5,]

array([[ 1.        ,  3.03344961,  9.20181654],
       [ 1.        ,  1.81002392,  3.27618659],
       [ 1.        ,  3.20558207, 10.27575638],
       [ 1.        ,  3.88656564, 15.10539244],
       [ 1.        ,  1.74625943,  3.049422  ]])