百分位数分布图

Question

百分位数分布图

6

有没有人知道如何更改X轴刻度和标尺以显示像下面这个图中的百分位分布呢？此图来自MATLAB，但我想使用Python（通过Matplotlib或Seaborn）生成。

Graph of distribution where there is lots of change >99%

根据@paulh的指针，现在我离成功更近了。这段代码：

import matplotlib
matplotlib.use('Agg')

import numpy as np
import matplotlib.pyplot as plt
import probscale
import seaborn as sns

clear_bkgd = {'axes.facecolor':'none', 'figure.facecolor':'none'}
sns.set(style='ticks', context='notebook', palette="muted", rc=clear_bkgd)

fig, ax = plt.subplots(figsize=(8, 4))

x = [30, 60, 80, 90, 95, 97, 98, 98.5, 98.9, 99.1, 99.2, 99.3, 99.4]
y = np.arange(0, 12.1, 1)

ax.set_xlim(40, 99.5)
ax.set_xscale('prob')

ax.plot(x, y)
sns.despine(fig=fig)

生成以下图表（注意重新分配的X轴）：

Graph with non-linear x-axis

我发现这比标准刻度更有用：

Graph with normal x-axis

我联系了原始图表的作者，他们给了我一些指导。它实际上是一个对数刻度图，x轴反转并且值为[100-val]，手动标记x轴刻度。下面的代码使用与其他图表相同的示例数据重新创建了原始图像。

import matplotlib
matplotlib.use('Agg')

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

clear_bkgd = {'axes.facecolor':'none', 'figure.facecolor':'none'}
sns.set(style='ticks', context='notebook', palette="muted", rc=clear_bkgd)

x = [30, 60, 80, 90, 95, 97, 98, 98.5, 98.9, 99.1, 99.2, 99.3, 99.4]
y = np.arange(0, 12.1, 1)

# Number of intervals to display.
# Later calculations add 2 to this number to pad it to align with the reversed axis
num_intervals = 3
x_values = 1.0 - 1.0/10**np.arange(0,num_intervals+2)

# Start with hard-coded lengths for 0,90,99
# Rest of array generated to display correct number of decimal places as precision increases
lengths = [1,2,2] + [int(v)+1 for v in list(np.arange(3,num_intervals+2))]

# Build the label string by trimming on the calculated lengths and appending %
labels = [str(100*v)[0:l] + "%" for v,l in zip(x_values, lengths)]


fig, ax = plt.subplots(figsize=(8, 4))

ax.set_xscale('log')
plt.gca().invert_xaxis()
# Labels have to be reversed because axis is reversed
ax.xaxis.set_ticklabels( labels[::-1] )

ax.plot([100.0 - v for v in x], y)

ax.grid(True, linewidth=0.5, zorder=5)
ax.grid(True, which='minor', linewidth=0.5, linestyle=':')

sns.despine(fig=fig)

plt.savefig("test.png", dpi=300, format='png')

这是生成的图表：

- Chris Osterwood

3

你是否编写过任何代码或付出过努力来完成这个任务？如果是，请在此处发布。 - Mad Physicist

我完全不理解为什么这个问题被关闭，理由是“范围过大”。虽然它缺乏一个好的问题描述，但从查看图表中就可以明显地看出问题所在。如果有一种方法可以生成这种类型的图表，那么肯定只需要几行代码，因此答案既不会太长，也不会有太多可能的答案。 - ImportanceOfBeingErnest

@Chris Osterwood 请提供生成此类图形的Matlab命令，并以文本形式提供清晰的问题描述，而不仅仅是发布图片。您可以通过将它们作为评论发布来这样做，以便更有经验的用户可以将它们纳入问题中。 - ImportanceOfBeingErnest

我认为你想要使用我的库：http://phobson.github.io/mpl-probscale/ - Paul H

@PaulH，差点就可以了，我联系了原始图表的作者，他们指出了正确的方向（对数坐标轴反转x轴和手动标记刻度）。我的问题已经修改了，附带了展示如何在Python中实现此操作的代码。 - Chris Osterwood

显示剩余2条评论

2个回答

1

这种类型的图表在低延迟社区中很受欢迎，用于绘制延迟分布。在处理延迟时，大部分有趣的信息往往在较高的百分位数中，因此对数视图往往更好。我最初看到这些图表是在 https://github.com/giltene/jHiccup 和 https://github.com/HdrHistogram/ 中使用的。

引用的图表是由以下代码生成的。

n = ceil(log10(length(values)));          
p = 1 - 1./10.^(0:0.01:n);
percentiles = prctile(values, p * 100);
semilogx(1./(1-p), percentiles);

下面的代码标记了x轴

labels = cell(n+1, 1);
for i = 1:n+1
  labels{i} = getPercentileLabel(i-1);
end
set(gca, 'XTick', 10.^(0:n));
set(gca, 'XTickLabel', labels);

% {'0%' '90%' '99%' '99.9%' '99.99%' '99.999%' '99.999%' '99.9999%'}
function label = getPercentileLabel(i)
    switch(i)
        case 0
            label = '0%';
        case 1
            label = '90%';
        case 2
            label = '99%';
        otherwise
            label = '99.';
            for k = 1:i-2
                label = [label '9'];
            end
            label = [label '%'];
    end
end

- Florian Enner

Florian - 谢谢你发布MATLAB代码，我相信这对将来的某个人会很有用。我同意这种规模对于具有“高尾巴”分布的数据来说更容易理解。 - Chris Osterwood

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- W1M0R · Accepted Answer

下面的Python代码使用 Pandas 读取包含记录的延迟值（以毫秒为单位）的csv文件，然后将这些延迟值（以微秒为单位）记录在HdrHistogram中，并将HdrHistogram保存到hgrm文件中，该文件随后将由Seaborn用于绘制延迟分布图。

import pandas as pd
from hdrh.histogram import HdrHistogram
from hdrh.dump import dump
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import sys
import argparse

# Parse the command line arguments.

parser = argparse.ArgumentParser()
parser.add_argument('csv_file')
parser.add_argument('hgrm_file')
parser.add_argument('png_file')
args = parser.parse_args()

csv_file = args.csv_file
hgrm_file = args.hgrm_file
png_file = args.png_file

# Read the csv file into a Pandas data frame and generate an hgrm file.

csv_df = pd.read_csv(csv_file, index_col=False)

USECS_PER_SEC=1000000
MIN_LATENCY_USECS = 1
MAX_LATENCY_USECS = 24 * 60 * 60 * USECS_PER_SEC # 24 hours
# MAX_LATENCY_USECS = int(csv_df['response-time'].max()) * USECS_PER_SEC # 1 hour
LATENCY_SIGNIFICANT_DIGITS = 5
histogram = HdrHistogram(MIN_LATENCY_USECS, MAX_LATENCY_USECS, LATENCY_SIGNIFICANT_DIGITS)
for latency_sec in csv_df['response-time'].tolist():
    histogram.record_value(latency_sec*USECS_PER_SEC)
    # histogram.record_corrected_value(latency_sec*USECS_PER_SEC, 10)
TICKS_PER_HALF_DISTANCE=5
histogram.output_percentile_distribution(open(hgrm_file, 'wb'), USECS_PER_SEC, TICKS_PER_HALF_DISTANCE)

# Read the generated hgrm file into a Pandas data frame.

hgrm_df = pd.read_csv(hgrm_file, comment='#', skip_blank_lines=True, sep=r"\s+", engine='python', header=0, names=['Latency', 'Percentile'], usecols=[0, 3])

# Plot the latency distribution using Seaborn and save it as a png file.

sns.set_theme()
sns.set_style("dark")
sns.set_context("paper")
sns.set_color_codes("pastel")

fig, ax = plt.subplots(1,1,figsize=(20,15))
fig.suptitle('Latency Results')

sns.lineplot(x='Percentile', y='Latency', data=hgrm_df, ax=ax)
ax.set_title('Latency Distribution')
ax.set_xlabel('Percentile (%)')
ax.set_ylabel('Latency (seconds)')
ax.set_xscale('log')
ax.set_xticks([1, 10, 100, 1000, 10000, 100000, 1000000, 10000000])
ax.set_xticklabels(['0', '90', '99', '99.9', '99.99', '99.999', '99.9999', '99.99999'])

fig.tight_layout()
fig.savefig(png_file)