如何使用numpy正确绘制直方图，并将其与密度函数匹配？

Question

如何使用numpy正确绘制直方图，并将其与密度函数匹配？

pythonnumpymatplotlibstatisticsprobability-density

3

TL;DR: 如何使用Numpy正确绘制np.histogram(..., density=True)的结果？

使用density=True应该能够匹配样本的直方图和底层随机变量的密度函数，但实际上并没有。

import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
y = np.random.randn(10000)
h, bins = np.histogram(y, bins=1000, density=True)
plt.bar(bins[:-1], h)
x = np.linspace(-10, 10, 100)
f = scipy.stats.norm.pdf(x)
plt.plot(x, f, color="green")
plt.show()

为什么直方图和概率密度函数没有相应地进行缩放？

在这种情况下，观察表明1.6倍的缩放效果更好。

plt.plot(x, 1.6 * f, color="green")

此外，这个正常运作：

plt.hist(y, bins=100, density=True)

为什么？

- Basj

这个图形不太清晰，我只看到了直方图，而没有看到密度函数。关于直方图，我觉得还可以——它应该具有总面积为1的特性，并且根据目测，在范围-2.5到2.5之间平均大约为0.2，这似乎是合理的。如果你把所有的矩形加起来，你会得到什么结果呢？ - undefined

@RobertDodier，我更新了截图，如果你运行我的示例代码，你是否看到相同的行为？ - undefined

好的，当我运行你的代码时，我看到的图像与你展示的第一个图像相同，正如你指出的那样是错误的。然而，我注意到 h.sum()*0.0075762 的结果为 1.0000002056823407，其中 0.0075762 是通过检查 bins 找到的区间间隔。这表明直方图构建正确，因此问题出在绘图上。 - undefined

1

再仔细看一下，似乎plt.bar(bins[:-1], h)绘制的柱状图比直方图的箱子要宽得多 - 箱子大约是0.007宽度，但显示的柱子要宽得多，可能是0.5甚至更大。plt.bar是否有关于允许的最小宽度的概念？我在第二张图片中看到，柱子要窄得多。 - undefined

@RobertDodier 哦，没错，也许我们应该以不同的方式结合np.histogram来使用plt.bar？ - undefined

也许有办法告诉plt.bar忘记最小条形宽度或其他阻止它准确展示柱状图的因素。 - undefined

3个回答

1

使用自动垃圾箱怎么样？

h, bins = np.histogram(y, bins='auto', density=True)

- nisakova

是的，它确实起作用，但为什么在bins=1000或100的情况下不起作用呢？（在这种情况下，缩放是错误的，为什么呢？） - undefined

在这种情况下，我认为这取决于数据大小。如果增加输入数据大小并使用bins=1000或bins=100，它就能正常工作。 - undefined

0

import scipy
import numpy as np
import matplotlib.pyplot as plt
y = np.random.randn(10000)
h, bins = np.histogram(y, bins=100, density=True)
x = np.linspace(bins.min(), bins.max(), 100)
f = scipy.stats.norm.pdf(x)
plt.bar(bins[:-1], h, width=bins[1] - bins[0])
plt.plot(x, f, color="green")
plt.xlabel("x")
plt.ylabel("Density")
plt.title("Histogram and Density Function")
plt.show()

- Harsh Chitaliya

谢谢 @HarshChitaliya！以后请贴出结果的截图，这样会更有帮助。 - undefined

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- cottontail · Accepted Answer

简而言之：

.bar()的默认条形宽度太大（.hist()会自动调整宽度）。
默认的图表尺寸对于条形的数量来说太小了（这就是为什么100个柱子还可以，但1000个就不行了）。

在`Axes.hist`中，条形图的宽度是通过`np.diff(bins)`计算的（源代码）。虽然它允许多维数组，但在幕后进行了大量的验证和重塑，但如果我们将所有这些都放在一边，对于一个一维数组，`Axes.hist`只是`np.histogram`和`Axes.bar`的包装器，其（摘要）代码如下所示：

height, bins = np.histogram(x, bins)
width = np.diff(bins)
boffset = 0.5 * width
Axes.bar(bins[:-1]+boffset, height, width)

另一方面，Axes.bar 通过使用默认宽度0.8（源代码实现）迭代地向Axes添加matplotlib.patches.Rectangle对象，因此如果柱状图特别高且后续的柱状图较短，短的柱状图将被高的柱状图遮挡。

下面的代码（有点）说明了上述观点。直方图是相同的；例如最高的柱状图是相同的。请注意，在下面的图中，figsize为12"x5"，每个Axes大约为3"宽），因此考虑到默认dpi为100，它只能水平显示大约300个点，这意味着它无法正确显示所有1000个柱状图。我们需要一个适当宽度的图形来正确显示所有的柱状图。

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# sample
y = np.random.default_rng(0).standard_normal(10000)
N = 1000
h, bins = np.histogram(y, bins=N, density=True)
x = np.linspace(-5, 5, 100)
f = stats.norm.pdf(x)

# figure
fig, axs = plt.subplots(1, 3, figsize=(12,5))
a0 = axs[0].bar(bins[:-1], h)
a1 = axs[1].bar(bins[:-1]+0.5*np.diff(bins), h, np.diff(bins))
h_hist, bins_hist, a2 = axs[2].hist(y, bins=N, density=True)
for a, t in zip(axs, ['Axes.bar with default width', 'Axes.bar with width\nrelative to bins', 'Axes.hist']):
    a.plot(x, f, color="green")
    a.set_title(t)

# label tallest bar of each Axes
axs[0].bar_label(a0, [f'{h.max():.2f}' if b.get_height() == h.max() else '' for b in a0], fontsize=7)
axs[1].bar_label(a1, [f'{h.max():.2f}' if b.get_height() == h.max() else '' for b in a1], fontsize=7)
axs[2].bar_label(a2, [f'{h_hist.max():.2f}' if b.get_height() == h_hist.max() else '' for b in a2], fontsize=7)

# the bin edges and heights from np.histogram and Axes.hist are the same
(h == h_hist).all() and (bins == bins_hist).all()     # True

例如，如果我们使用figsize=(60,5)绘制相同的图形（其他所有内容相同），我们将得到以下图形，其中柱状图显示正确（特别是调整宽度的Axes.hist和Axes.bar是相同的）。