理解高斯混合模型

17

我正在尝试理解scikit-learn高斯混合模型实现的结果。请看以下示例:

#!/opt/local/bin/python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture

# Define simple gaussian
def gauss_function(x, amp, x0, sigma):
    return amp * np.exp(-(x - x0) ** 2. / (2. * sigma ** 2.))

# Generate sample from three gaussian distributions
samples = np.random.normal(-0.5, 0.2, 2000)
samples = np.append(samples, np.random.normal(-0.1, 0.07, 5000))
samples = np.append(samples, np.random.normal(0.2, 0.13, 10000))

# Fit GMM
gmm = GaussianMixture(n_components=3, covariance_type="full", tol=0.001)
gmm = gmm.fit(X=np.expand_dims(samples, 1))

# Evaluate GMM
gmm_x = np.linspace(-2, 1.5, 5000)
gmm_y = np.exp(gmm.score_samples(gmm_x.reshape(-1, 1)))

# Construct function manually as sum of gaussians
gmm_y_sum = np.full_like(gmm_x, fill_value=0, dtype=np.float32)
for m, c, w in zip(gmm.means_.ravel(), gmm.covariances_.ravel(), 
               gmm.weights_.ravel()):
    gmm_y_sum += gauss_function(x=gmm_x, amp=w, x0=m, sigma=np.sqrt(c))

# Normalize so that integral is 1    
gmm_y_sum /= np.trapz(gmm_y_sum, gmm_x)

# Make regular histogram
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=[8, 5])
ax.hist(samples, bins=50, normed=True, alpha=0.5, color="#0070FF")
ax.plot(gmm_x, gmm_y, color="crimson", lw=4, label="GMM")
ax.plot(gmm_x, gmm_y_sum, color="black", lw=4, label="Gauss_sum")

# Annotate diagram
ax.set_ylabel("Probability density")
ax.set_xlabel("Arbitrary units")

# Draw legend
plt.legend()
plt.show()

代码生成的图形

首先我生成了由高斯分布构成的样本分布,然后对这些数据进行高斯混合模型拟合。接下来,我想计算给定输入的概率。方便的是,scikit实现提供了score_samples方法来完成这一任务。现在我正在尝试理解这些结果。我一直以为,我可以从GMM拟合中取出高斯分布的参数,通过对它们求和并将积分标准化到1来构建完全相同的分布。但是,正如您在图中所看到的那样,从score_samples方法中绘制的样本(红线)完全适合于原始数据(蓝色直方图),而手动构建的分布(黑线)却不适合。我希望了解我的想法错在哪里以及为什么不能通过对GMM拟合给出的高斯分布进行求和来构建分布!?!非常感谢任何意见!

1个回答

23

以防未来有人对同样的问题感到困惑:需要归一化各个分量,而不是总和:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture

# Define simple gaussian
def gauss_function(x, amp, x0, sigma):
    return amp * np.exp(-(x - x0) ** 2. / (2. * sigma ** 2.))

# Generate sample from three gaussian distributions
samples = np.random.normal(-0.5, 0.2, 2000)
samples = np.append(samples, np.random.normal(-0.1, 0.07, 5000))
samples = np.append(samples, np.random.normal(0.2, 0.13, 10000))

# Fit GMM
gmm = GaussianMixture(n_components=3, covariance_type="full", tol=0.001)
gmm = gmm.fit(X=np.expand_dims(samples, 1))

# Evaluate GMM
gmm_x = np.linspace(-2, 1.5, 5000)
gmm_y = np.exp(gmm.score_samples(gmm_x.reshape(-1, 1)))

# Construct function manually as sum of gaussians
gmm_y_sum = np.full_like(gmm_x, fill_value=0, dtype=np.float32)
for m, c, w in zip(gmm.means_.ravel(), gmm.covariances_.ravel(), gmm.weights_.ravel()):
    gauss = gauss_function(x=gmm_x, amp=1, x0=m, sigma=np.sqrt(c))
    gmm_y_sum += gauss / np.trapz(gauss, gmm_x) * w

# Make regular histogram
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=[8, 5])
ax.hist(samples, bins=50, normed=True, alpha=0.5, color="#0070FF")
ax.plot(gmm_x, gmm_y, color="crimson", lw=4, label="GMM")
ax.plot(gmm_x, gmm_y_sum, color="black", lw=4, label="Gauss_sum", linestyle="dashed")

# Annotate diagram
ax.set_ylabel("Probability density")
ax.set_xlabel("Arbitrary units")

# Make legend
plt.legend()

plt.show()

输入图像描述


感谢您发布答案。 - Jeru Luke
非常简洁明了,谢谢。我之前在将数据传递给GaussianMixture.fit时遇到了很多问题,因为我没有意识到需要使用np.expand_dims(samples, 1).shape的形式而不是samples.shape - FriskyGrub
1
现在你该如何计算新测试样本X的概率(以便估计数据点是否是新奇的)?据我所知,np.exp(gmm.score_samples(X))给出的是X处概率密度函数的值,而不是X的概率。 - felipeduque

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接