scipy.curve_fit与numpy.polyfit有不同的协方差矩阵。

Question

scipy.curve_fit与numpy.polyfit有不同的协方差矩阵。

7

我使用Python 3.6进行数据拟合。最近，我遇到了以下问题，由于缺乏经验，不确定如何处理。

如果我在同一组数据点上使用numpy.polyfit(x, y, 1, cov=True)和scipy.curve_fit(lambda: x, a, b: a*x+b, x, y)，得到的系数a和b几乎相同。但是，scipy.curve_fit的协方差矩阵的值大约是numpy.polyfit的值的一半。

由于我想使用协方差矩阵的对角线来估计系数的不确定度（u = numpy.sqrt(numpy.diag(cov))），我有三个问题：

1. 哪个协方差矩阵才是正确的（我应该使用哪个）？ 2. 为什么会有差异？ 3. 需要什么才能使它们相等？

谢谢！编辑：

import numpy as np
import scipy.optimize as sc

data = np.array([[1,2,3,4,5,6,7],[1.1,1.9,3.2,4.3,4.8,6.0,7.3]]).T

x=data[:,0]
y=data[:,1]

A=np.polyfit(x,y,1, cov=True)
print('Polyfit:', np.diag(A[1]))

B=sc.curve_fit(lambda x,a,b: a*x+b, x, y)
print('Curve_Fit:', np.diag(B[1]))

如果我使用 statsmodels.api，结果将对应于 curve_fit 的结果。

- user168000

1

欢迎来到 Stack Overflow！您能为我们准备两个最小化的示例以便我们能够看到这一点吗？请查看 [mcve]。（适合3或4个点即可） - kvantour

2个回答

2

这两种方法以不同的方式计算协方差。我不确定polyfit使用的方法，但是curve_fit通过求模型雅可比矩阵J的转置乘以J的逆来估计协方差矩阵。观察polyfit的代码，似乎他们对lhs.T.dot(lhs)进行了求逆操作，其中lhs被定义为范德蒙矩阵，尽管我必须承认我不知道第二种方法的数学背景。

至于你关于哪个方法是正确的问题，polyfit的代码有以下注释：

# Some literature ignores the extra -2.0 factor in the denominator, but
#  it is included here because the covariance of Multivariate Student-T
#  (which is implied by a Bayesian uncertainty analysis) includes it.
#  Plus, it gives a slightly more conservative estimate of uncertainty.

基于此，以及您的观察结果，似乎polyfit始终比curve_fit给出更大的估计值。这是有道理的，因为J.T.dot(J)是协方差矩阵的一阶近似。因此，在怀疑时，过高估计误差总是更好的选择。

但是，如果您知道数据中的测量误差，请建议同时提供它们，并使用absolute_sigma=True调用curve_fit。从我的tests中得出的结论是，这样做可以匹配预期的分析结果，因此我很想知道在提供测量误差时哪个方法更好。

- tBuLi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Daniel F · Accepted Answer

我猜想这与这里有关，它涉及到IT技术。

593          # Some literature ignores the extra -2.0 factor in the denominator, but 
594          #  it is included here because the covariance of Multivariate Student-T 
595          #  (which is implied by a Bayesian uncertainty analysis) includes it. 
596          #  Plus, it gives a slightly more conservative estimate of uncertainty. 
597          if len(x) <= order + 2: 
598              raise ValueError("the number of data points must exceed order + 2 " 
599                               "for Bayesian estimate the covariance matrix") 
600          fac = resids / (len(x) - order - 2.0) 
601          if y.ndim == 1: 
602              return c, Vbase * fac 
603          else: 
604              return c, Vbase[:,:, NX.newaxis] * fac

在这种情况下，len(x) - order为4，(len(x) - order - 2.0)为2，这就解释了为什么您的值相差2倍。

这回答了问题2。对于更大的len(x)，问题3的答案可能是“获取更多数据”，因为差异可能会变得可以忽略不计。

哪个公式是正确的（问题1）可能是一个问题，需要请教Cross Validated，但我认为应该使用curve_fit，因为它明确旨在计算您所述的不确定性。从文档中可以看出：

pcov：2d数组

popt的估计协方差。对角线提供参数估计的方差。要计算参数的标准偏差，请使用perr = np.sqrt(np.diag(pcov))。

虽然上面polyfit代码的注释表明其意图更多是进行学生T分析。