如何绘制正态分布曲线并结合中心极限定理？

Question

如何绘制正态分布曲线并结合中心极限定理？

3

我正在尝试获得一条正态分布曲线，以便表示我的中心极限数据分布。

下面是我尝试过的实现方式。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import math

# 1000 simulations of die roll
n = 10000

avg = []
for i in range(1,n):#roll dice 10 times for n times
    a = np.random.randint(1,7,10)#roll dice 10 times from 1 to 6 & capturing each event
    avg.append(np.average(a))#find average of those 10 times each time

plt.hist(avg[0:])

zscore = stats.zscore(avg[0:])

mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)

# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

# Plot the distribution curve
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2)))

我得到了以下图形：

您可以在底部看到红色的正常曲线。

有人能告诉我为什么曲线不适合吗？

- penta

你可能需要对其中一个进行缩放。正态分布的最大值在1以上，如果我没记错的话，而你的图形高达2500。 - Uwe Ziegenhagen

要么将正态分布缩放到最大值（约为2700），要么使用ax.twinx()。 - f.wue

你能用代码展示一下吗？ - penta

4个回答

1

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import math

# 1000 simulations of die roll
n = 10000

avg = []
for i in range(1,n):#roll dice 10 times for n times
    a = np.random.randint(1,7,10)#roll dice 10 times from 1 to 6 & capturing each event
    avg.append(np.average(a))#find average of those 10 times each time

plt.hist(avg[0:],20,normed=True)

zscore = stats.zscore(avg[0:])

mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)

# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

# Plot the distribution curve
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2)))

我刚刚缩小了平均列表直方图。

图表：

- Justice_Lords

能否请您再详细解释一下？因为我无法理解即使我注释掉最后一个绘制直方图的代码行，为什么还是会出现两个直方图。 - penta

正如注释所示，它绘制的是曲线而不是直方图。那个绿色的曲线在图中。注释掉最后一行只会移除绿色曲线。 - Justice_Lords

0

逻辑看起来是正确的。

问题在于数据的显示。

尝试使用normed=true对第一个直方图进行归一化，并为两个直方图设置相等的区间，例如20个区间。

- Amin Taheri

0

掷骰子是一种均匀分布的情况。任何从1到6的数字出现的概率都是1/6。因此，平均值和标准偏差由以下公式给出：

现在，中心极限定理表明，对于足够大的n值（代码中为10），n次投掷的平均数的概率密度函数将逐渐接近于均值为3.5，标准差为1.7078/sqrt(10)的正态分布。

n_bins=50
pdf_from_hist, bin_edges=np.histogram(np.array(avg), bins=n_bins, density=True)
bin_mid_pts= np.add(bin_edges[:-1], bin_edges[1:])*0.5
assert(len(list(pdf_from_hist))  == len(list(bin_mid_pts)))
expected_std=1.7078/math.sqrt(10)
expected_mean=3.5
pk_s=[]
qk_s=[]
for i in range(n_bins):
    p=stat.norm.pdf(bin_mid_pts[i], expected_mean, expected_std) 
    q=pdf_from_hist[i]
    if q <= 1.0e-5:
        continue
    pk_s.append(p)
    qk_s.append(q)
#compute the kl divergence
kl_div=stat.entropy(pk_s, qk_s)
print('the pdf of the mean of the 10 throws differ from the corresponding normal dist with a kl divergence of %r' % kl_div)

- koshy george

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jwalton · Accepted Answer

你差点就做对了！首先，要看到你正在同一坐标轴上绘制两个直方图：

plt.hist(avg[0:])

并且

plt.hist(s, 20, normed=True)

为了在直方图上绘制正态密度，您正确地使用normed=True参数对第二个图进行了标准化。但是，您忘记了对第一个直方图进行标准化（plt.hist(avg[0:]), normed=True）。

我建议您既然已经导入了scipy.stats，最好使用该模块中提供的正态分布，而不是自己编写pdf。

将所有这些放在一起，我们有：

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# 1000 simulations of die roll
n = 10000

avg = []
for i in range(1,n):
    a = np.random.randint(1,7,10)
    avg.append(np.average(a))

# CHANGED: normalise this histogram too
plt.hist(avg[0:], 20, normed=True)

zscore = stats.zscore(avg[0:])

mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)

# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

# Use scipy.stats implementation of the normal pdf
# Plot the distribution curve
x = np.linspace(1.5, 5.5, num=100)
plt.plot(x, stats.norm.pdf(x, mu, sigma))

这给了我以下情节：

编辑

在评论中，您问：

我如何在 np.linspace 中选择 1.5 和 5.5？
是否可以在未归一化的直方图上绘制正态核密度估计图？

首先回答问题1，我通过观察选择了1.5和5.5。在绘制直方图后，我看到直方图的区间看起来在1.5和5.5之间，因此我们希望在这个范围内绘制正态分布。

更加程序化的选择这个范围的方法是：

x = np.linspace(bins.min(), bins.max(), num=100)

关于第二个问题，是的，我们可以实现您想要的。但是，您应该知道我们将不再绘制概率密度函数。

在绘制直方图时，删除“normed=True”参数后：

x = np.linspace(bins.min(), bins.max(), num=100)

# Find pdf of normal kernel at mu
max_density = stats.norm.pdf(mu, mu, sigma)
# Calculate how to scale pdf
scale = count.max() / max_density

plt.plot(x, scale * stats.norm.pdf(x, mu, sigma))

这给了我以下的情节：