Python中高效的在线线性回归算法

4

我得到了一个包括两列xy的2D数据集。当新的数据输入时,我希望能够动态地获取线性回归系数和截距。使用scikit-learn,我可以像这样计算所有当前可用的数据:

from sklearn.linear_model import LinearRegression
regr = LinearRegression()
x = np.arange(100)
y = np.arange(100)+10*np.random.random_sample((100,))
regr.fit(x,y)
print(regr.coef_)
print(regr.intercept_)

然而,我的数据集很大(总共超过10k行),我想尽可能快地计算系数和截距,每当有新的行出现时。目前,计算10k行需要大约600微秒,我想加速这个过程。Scikit-learn似乎没有在线更新线性回归模块的功能。有更好的方法吗?

在sklearn中,只有估计器(在此处注明)具有在线学习的能力。 - Vivek Kumar
@VivekKumar 有没有其他公式或包可以解决这个问题? - Kevin Fang
1
sklearn.linear_model.SGDRegressor是线性回归,但它不使用最小二乘法,而是使用梯度下降法。你应该尝试一下,看看输出是否足够接近(或者至少“损失”相同),此外,SGD(随机梯度下降)在具有大特征维度的大数据集上速度更快。http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor - Eran Moshe
4个回答

9

我从这篇论文中找到了解决方案:更新简单线性回归。具体实现如下:

def lr(x_avg,y_avg,Sxy,Sx,n,new_x,new_y):
    """
    x_avg: average of previous x, if no previous sample, set to 0
    y_avg: average of previous y, if no previous sample, set to 0
    Sxy: covariance of previous x and y, if no previous sample, set to 0
    Sx: variance of previous x, if no previous sample, set to 0
    n: number of previous samples
    new_x: new incoming 1-D numpy array x
    new_y: new incoming 1-D numpy array x
    """
    new_n = n + len(new_x)

    new_x_avg = (x_avg*n + np.sum(new_x))/new_n
    new_y_avg = (y_avg*n + np.sum(new_y))/new_n

    if n > 0:
        x_star = (x_avg*np.sqrt(n) + new_x_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
        y_star = (y_avg*np.sqrt(n) + new_y_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
    elif n == 0:
        x_star = new_x_avg
        y_star = new_y_avg
    else:
        raise ValueError

    new_Sx = Sx + np.sum((new_x-x_star)**2)
    new_Sxy = Sxy + np.sum((new_x-x_star).reshape(-1) * (new_y-y_star).reshape(-1))

    beta = new_Sxy/new_Sx
    alpha = new_y_avg - beta * new_x_avg
    return new_Sxy, new_Sx, new_n, alpha, beta, new_x_avg, new_y_avg

性能比较:

Scikit learn版本可以一次计算10k个样本。

from sklearn.linear_model import LinearRegression
x = np.arange(10000).reshape(-1,1)
y = np.arange(10000)+100*np.random.random_sample((10000,))
regr = LinearRegression()
%timeit regr.fit(x,y)
# 419 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

我的版本假设已经计算了9k个样本:

Sxy, Sx, n, alpha, beta, new_x_avg, new_y_avg = lr(0, 0, 0, 0, 0, x.reshape(-1,1)[:9000], y[:9000])
new_x, new_y = x.reshape(-1,1)[9000:], y[9000:]
%timeit lr(new_x_avg, new_y_avg, Sxy,Sx,n,new_x, new_y)
# 38.7 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

预计速度提高10倍。


1
你得到的预测/系数与sklearn相似吗? - Vivek Kumar
@VivekKumar,它们的系数和截距是相同的。 - Kevin Fang
Sxy实际上是协方差的n倍,而Sx实际上是x的方差的n倍,不是吗? - Jose Solorzano

1
我将Kevin Fang的回答改编成了一个类:
import numpy as np

class OnlineLinearRegression:
    """
    Online linear regression in O(1) mem & compute
    """
    def __init__(self):
        # Average of all x seen
        self.x_avg = 0.
        # Average of all y seen
        self.y_avg = 0.
        # Covariance of all x and y seen
        self.xy_covar = 0.
        # Variance of all x seen
        self.x_var = 0.
        # Number of observations seen
        self.n = 0

    @property
    def parameters(self):
        """
        :return: the parameters of the linear regression (beta, alpha) such that y = beta * x + alpha. If there are
        less than 2 observations, returns (None, None).
        """
        if self.n < 2:
            return None, None
        else:
            beta = self.xy_covar / self.x_var
            alpha = self.y_avg - beta * self.x_avg
            return beta, alpha

    def update_multiple(self, new_x: np.ndarray, new_y: np.ndarray):
        assert len(new_x) == len(new_y)
        new_n = self.n + len(new_x)

        new_x_avg = (self.x_avg * self.n + np.sum(new_x)) / new_n
        new_y_avg = (self.y_avg * self.n + np.sum(new_y)) / new_n

        if self.n:
            x_star = (self.x_avg * np.sqrt(self.n) + new_x_avg * np.sqrt(new_n)) / (np.sqrt(self.n) + np.sqrt(new_n))
            y_star = (self.y_avg * np.sqrt(self.n) + new_y_avg * np.sqrt(new_n)) / (np.sqrt(self.n) + np.sqrt(new_n))
        else:
            x_star = new_x_avg
            y_star = new_y_avg

        self.n = new_n
        self.x_avg = new_x_avg
        self.y_avg = new_y_avg

        self.x_var = self.x_var + np.sum((new_x - x_star) ** 2)
        self.xy_covar = self.xy_covar + np.sum((new_x - x_star).reshape(-1) * (new_y - y_star).reshape(-1))

    def update(self, x: float, y: float):
        self.update_multiple(np.array([x]), np.array([y]))

0

你可以使用实现更快算法的加速库 - 特别是 https://github.com/intel/scikit-learn-intelex

对于线性回归,您将获得更好的性能。

首先安装包。

pip install scikit-learn-intelex

然后在你的Python脚本中添加

from sklearnex import patch_sklearn
patch_sklearn()

虽然这个链接可能回答了问题,但最好在此处包含答案的基本部分并提供参考链接。如果链接页面更改,仅有链接的答案可能会失效。-【来自审查】 - Simas Joneliunas

0

太好了!感谢分享你的发现 :) 这是一个等价实现该解决方案的基于点积编写版本:

class SimpleLinearRegressor(object):
    def __init__(self):
        self.dots = np.zeros(5)
        self.intercept = None
        self.slope = None

    def update(self, x: np.ndarray, y: np.ndarray):
        self.dots += np.array(
            [
                x.shape[0],
                x.sum(),
                y.sum(),
                np.dot(x, x),
                np.dot(x, y),
            ]
        )
        size, sum_x, sum_y, sum_xx, sum_xy = self.dots
        det = size * sum_xx - sum_x ** 2
        if det > 1e-10:  # determinant may be zero initially
            self.intercept = (sum_xx * sum_y - sum_xy * sum_x) / det
            self.slope = (sum_xy * size - sum_x * sum_y) / det

在处理时间序列数据时,我们可以将这个想法扩展到使用软(类似于EMA)窗口进行滑动窗口回归


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接