最优的双变量线性回归计算

12

问题

我想将 y = mx + b 方程(其中 m 为 SLOPE,b 为 INTERCEPT)应用于一个数据集,该数据集如 SQL 代码所示获取。查询(MySQL)结果的值为:

SLOPE = 0.0276653965651912
INTERCEPT = -57.2338357550468

SQL 代码

SELECT
  ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
  (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,

  ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
  (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
  (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
FROM
(SELECT
  D.AMOUNT,
  Y.YEAR
FROM
  CITY C, STATION S, YEAR_REF Y, MONTH_REF M, DAILY D
WHERE
  -- For a specific city ...
  --
  C.ID = 8590 AND
  -- Find all the stations within a 15 unit radius ...
  --
  SQRT( POW( C.LATITUDE - S.LATITUDE, 2 ) + POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < 15 AND
  -- Gather all known years for that station ...
  --
  S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND
  -- The data before 1900 is shaky; insufficient after 2009.
  --
  Y.YEAR BETWEEN 1900 AND 2009 AND
  -- Filtered by all known months ...
  --
  M.YEAR_REF_ID = Y.ID AND
  -- Whittled down by category ...
  --
  M.CATEGORY_ID = '001' AND
  -- Into the valid daily climate data.
  --
  M.ID = D.MONTH_REF_ID AND
  D.DAILY_FLAG_ID <> 'M'
  GROUP BY Y.YEAR
  ORDER BY Y.YEAR
) t

问题

以下结果(用于计算线的起始点和结束点)似乎不正确。为什么结果偏差约10度(例如,异常值扭曲了数据)?

(1900 * 0.0276653965651912) + (-57.2338357550468) = -4.66958228

(2009 * 0.0276653965651912) + (-57.2338357550468) = -1.65405406

(请注意,数据不再与图片匹配;代码。)

我本来期望1900年的结果大约是10(而不是-4.67),而2009年的结果大约是11.50(而不是-1.65)。

相关网站

2个回答

1

现在已经验证为正确:

SELECT
  ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
  (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,

  ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
  (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
  (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,

  ((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
  (stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION
FROM (
  SELECT
    AVG(D.AMOUNT) as AMOUNT,
    Y.YEAR as YEAR
  FROM
    CITY C,
    STATION S,
    YEAR_REF Y,
    MONTH_REF M,
    DAILY D
  WHERE
    C.ID = 8590 AND

    SQRT(
      POW( C.LATITUDE - S.LATITUDE, 2 ) +
      POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < 15 AND

    S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND

    Y.YEAR BETWEEN 1900 AND 2009 AND

    M.YEAR_REF_ID = Y.ID AND

    M.CATEGORY_ID = '001' AND

    M.ID = D.MONTH_REF_ID AND
    D.DAILY_FLAG_ID <> 'M'
  GROUP BY
    Y.YEAR
) t

请查看图片以获取有关斜率、截距和(皮尔逊)相关性的详细信息。

4
恭喜你。也许有人应该和你谈谈 R(http://www.r-project.org/)。你可以用一行代码完成。 - gd047
请记住,回归中的响应变量被假定为彼此独立。但是,如果数据是随时间收集的(这在您的情况下是正确的),并且相互靠近的观测值是相关的,则该假设有时不适用。缺乏独立性不会对最小二乘系数估计产生偏差,但标准误差会受到严重影响。 - gd047
在Google上搜索“线性回归假设”独立性“串行相关性”,例如http://www.basic.northwestern.edu/statguidefiles/mulreg_ass_viol.html#Lack%20of%20independence。 - gd047
如果你使用R语言,尝试更复杂的模型将会变得容易。自回归模型在这里可能会很有用。 - Jonathan Chang

1

尝试拆分函数,您计算参数时出现了错误。请参考这里

我会做类似以下的事情(请原谅我对SQL语法和临时变量记忆不太清楚,所以代码可能实际上是错误的):

SELECT

sum(t.YEAR) / count(1) AS avgX,

sum(t.AMOUNT) / count(1) AS avgY,

sum(t.AMOUNT*t.YEAR) / count(1) AS avgXY,

sum(power(t.YEAR, 2)) / count(1) AS avgXsq,

( avgXY - avgX * avgY ) / ( avgXsq - power(avgX, 2) )  as SLOPE,

avgY - SLOPE * avgX as INTERCEPT,

数字现在没问题了;之前可能有数据冲突。 - Dave Jarvis

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接