百分位数分布图

6

有没有人知道如何更改X轴刻度和标尺以显示像下面这个图中的百分位分布呢?此图来自MATLAB,但我想使用Python(通过Matplotlib或Seaborn)生成。

Graph of distribution where there is lots of change >99%

根据@paulh的指针,现在我离成功更近了。这段代码:

import matplotlib
matplotlib.use('Agg')

import numpy as np
import matplotlib.pyplot as plt
import probscale
import seaborn as sns

clear_bkgd = {'axes.facecolor':'none', 'figure.facecolor':'none'}
sns.set(style='ticks', context='notebook', palette="muted", rc=clear_bkgd)

fig, ax = plt.subplots(figsize=(8, 4))

x = [30, 60, 80, 90, 95, 97, 98, 98.5, 98.9, 99.1, 99.2, 99.3, 99.4]
y = np.arange(0, 12.1, 1)

ax.set_xlim(40, 99.5)
ax.set_xscale('prob')

ax.plot(x, y)
sns.despine(fig=fig)

生成以下图表(注意重新分配的X轴):

Graph with non-linear x-axis

我发现这比标准刻度更有用:

Graph with normal x-axis

我联系了原始图表的作者,他们给了我一些指导。它实际上是一个对数刻度图,x轴反转并且值为[100-val],手动标记x轴刻度。下面的代码使用与其他图表相同的示例数据重新创建了原始图像。
import matplotlib
matplotlib.use('Agg')

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

clear_bkgd = {'axes.facecolor':'none', 'figure.facecolor':'none'}
sns.set(style='ticks', context='notebook', palette="muted", rc=clear_bkgd)

x = [30, 60, 80, 90, 95, 97, 98, 98.5, 98.9, 99.1, 99.2, 99.3, 99.4]
y = np.arange(0, 12.1, 1)

# Number of intervals to display.
# Later calculations add 2 to this number to pad it to align with the reversed axis
num_intervals = 3
x_values = 1.0 - 1.0/10**np.arange(0,num_intervals+2)

# Start with hard-coded lengths for 0,90,99
# Rest of array generated to display correct number of decimal places as precision increases
lengths = [1,2,2] + [int(v)+1 for v in list(np.arange(3,num_intervals+2))]

# Build the label string by trimming on the calculated lengths and appending %
labels = [str(100*v)[0:l] + "%" for v,l in zip(x_values, lengths)]


fig, ax = plt.subplots(figsize=(8, 4))

ax.set_xscale('log')
plt.gca().invert_xaxis()
# Labels have to be reversed because axis is reversed
ax.xaxis.set_ticklabels( labels[::-1] )

ax.plot([100.0 - v for v in x], y)

ax.grid(True, linewidth=0.5, zorder=5)
ax.grid(True, which='minor', linewidth=0.5, linestyle=':')

sns.despine(fig=fig)

plt.savefig("test.png", dpi=300, format='png')

这是生成的图表: 具有“反对数比例尺”的图表

3
你是否编写过任何代码或付出过努力来完成这个任务?如果是,请在此处发布。 - Mad Physicist
我完全不理解为什么这个问题被关闭,理由是“范围过大”。虽然它缺乏一个好的问题描述,但从查看图表中就可以明显地看出问题所在。如果有一种方法可以生成这种类型的图表,那么肯定只需要几行代码,因此答案既不会太长,也不会有太多可能的答案。 - ImportanceOfBeingErnest
@Chris Osterwood 请提供生成此类图形的Matlab命令,并以文本形式提供清晰的问题描述,而不仅仅是发布图片。您可以通过将它们作为评论发布来这样做,以便更有经验的用户可以将它们纳入问题中。 - ImportanceOfBeingErnest
我认为你想要使用我的库:http://phobson.github.io/mpl-probscale/ - Paul H
@PaulH,差点就可以了,我联系了原始图表的作者,他们指出了正确的方向(对数坐标轴反转x轴和手动标记刻度)。我的问题已经修改了,附带了展示如何在Python中实现此操作的代码。 - Chris Osterwood
显示剩余2条评论
2个回答

3
下面的Python代码使用 Pandas 读取包含记录的延迟值(以毫秒为单位)的csv文件,然后将这些延迟值(以微秒为单位)记录在HdrHistogram中,并将HdrHistogram保存到hgrm文件中,该文件随后将由Seaborn用于绘制延迟分布图。
import pandas as pd
from hdrh.histogram import HdrHistogram
from hdrh.dump import dump
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import sys
import argparse

# Parse the command line arguments.

parser = argparse.ArgumentParser()
parser.add_argument('csv_file')
parser.add_argument('hgrm_file')
parser.add_argument('png_file')
args = parser.parse_args()

csv_file = args.csv_file
hgrm_file = args.hgrm_file
png_file = args.png_file

# Read the csv file into a Pandas data frame and generate an hgrm file.

csv_df = pd.read_csv(csv_file, index_col=False)

USECS_PER_SEC=1000000
MIN_LATENCY_USECS = 1
MAX_LATENCY_USECS = 24 * 60 * 60 * USECS_PER_SEC # 24 hours
# MAX_LATENCY_USECS = int(csv_df['response-time'].max()) * USECS_PER_SEC # 1 hour
LATENCY_SIGNIFICANT_DIGITS = 5
histogram = HdrHistogram(MIN_LATENCY_USECS, MAX_LATENCY_USECS, LATENCY_SIGNIFICANT_DIGITS)
for latency_sec in csv_df['response-time'].tolist():
    histogram.record_value(latency_sec*USECS_PER_SEC)
    # histogram.record_corrected_value(latency_sec*USECS_PER_SEC, 10)
TICKS_PER_HALF_DISTANCE=5
histogram.output_percentile_distribution(open(hgrm_file, 'wb'), USECS_PER_SEC, TICKS_PER_HALF_DISTANCE)

# Read the generated hgrm file into a Pandas data frame.

hgrm_df = pd.read_csv(hgrm_file, comment='#', skip_blank_lines=True, sep=r"\s+", engine='python', header=0, names=['Latency', 'Percentile'], usecols=[0, 3])

# Plot the latency distribution using Seaborn and save it as a png file.

sns.set_theme()
sns.set_style("dark")
sns.set_context("paper")
sns.set_color_codes("pastel")

fig, ax = plt.subplots(1,1,figsize=(20,15))
fig.suptitle('Latency Results')

sns.lineplot(x='Percentile', y='Latency', data=hgrm_df, ax=ax)
ax.set_title('Latency Distribution')
ax.set_xlabel('Percentile (%)')
ax.set_ylabel('Latency (seconds)')
ax.set_xscale('log')
ax.set_xticks([1, 10, 100, 1000, 10000, 100000, 1000000, 10000000])
ax.set_xticklabels(['0', '90', '99', '99.9', '99.99', '99.999', '99.9999', '99.99999'])

fig.tight_layout()
fig.savefig(png_file)

1
这种类型的图表在低延迟社区中很受欢迎,用于绘制延迟分布。在处理延迟时,大部分有趣的信息往往在较高的百分位数中,因此对数视图往往更好。我最初看到这些图表是在 https://github.com/giltene/jHiccuphttps://github.com/HdrHistogram/ 中使用的。
引用的图表是由以下代码生成的。
n = ceil(log10(length(values)));          
p = 1 - 1./10.^(0:0.01:n);
percentiles = prctile(values, p * 100);
semilogx(1./(1-p), percentiles);

下面的代码标记了x轴

labels = cell(n+1, 1);
for i = 1:n+1
  labels{i} = getPercentileLabel(i-1);
end
set(gca, 'XTick', 10.^(0:n));
set(gca, 'XTickLabel', labels);

% {'0%' '90%' '99%' '99.9%' '99.99%' '99.999%' '99.999%' '99.9999%'}
function label = getPercentileLabel(i)
    switch(i)
        case 0
            label = '0%';
        case 1
            label = '90%';
        case 2
            label = '99%';
        otherwise
            label = '99.';
            for k = 1:i-2
                label = [label '9'];
            end
            label = [label '%'];
    end
end

Florian - 谢谢你发布MATLAB代码,我相信这对将来的某个人会很有用。我同意这种规模对于具有“高尾巴”分布的数据来说更容易理解。 - Chris Osterwood

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接