为什么Bokeh比Matplotlib慢这么多?

7

我在Bokeh和matplotlib中绘制了一个箱线图。对于相同的数据,Bokeh的绘图速度大约慢了100倍。为什么Bokeh需要这么长时间呢?以下是我在Jupyter notebook中运行的代码:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib as mpl

from bokeh.charts import BoxPlot, output_notebook, show

from time import time

%matplotlib inline


# Generate data
N = 100000
x1 = 2 + np.random.randn(N)
y1 = ['a'] * N

x2 = -2 + np.random.randn(N)
y2 = ['b'] * N

X = list(x1) + list(x2)
Y = y1 + y2

data = pd.DataFrame()
data['Vals'] = X
data['Class'] = Y

df = data.apply(np.random.permutation)


# Time the bokeh plot
start_time = time()

p = BoxPlot(data, values='Vals', label='Class',\
            title="MPG Summary (grouped by CYL, ORIGIN)")
output_notebook()
show(p)

end_time = time()
print("Total time taken for Bokeh is {0}".format(end_time - start_time))


# time the matplotlib plot
start_time = time()

data.boxplot(column='Vals', by='Class', sym = 'o')

end_time = time()
print("Total time taken for matplotlib is {0}".format(end_time - start_time))

打印语句会产生以下输出:

Bokeh所需的总时间为11.8056321144104

Matplotlib所需的总时间为0.1586170196533203


也许这与Jupyter Notebook有关,而不是与库本身有关? - Moritz
快速浏览源代码,似乎bokeh纯粹使用Python编写?matplotlib是建立在明显更快的numpy之上的。 - roganjosh
1
从http://bokeh.pydata.org/en/latest/ "with high-performance interactivity over very large or streaming datasets." 由于Matplotlib只适用于部分大型数据集,我期望bokeh的表现至少与之相当。 - Moritz
@Moritz 但它们也做不同的事情。在同一个链接中,“针对现代Web浏览器进行演示”。因此,在这个领域,它可能是高性能的,但它将不得不应对所有绘制适合Web浏览器和您在那里期望的交互类型的开销。我不知道它从何时开始依赖JavaScript。 - roganjosh
1个回答

8

有一个特别的问题,具体涉及到bokeh.charts.BoxPlot。不幸的是,bokeh.charts目前没有维护者,因此我无法确定何时可以修复或改进该问题。

但是,如果对您有用,我将在下面演示您可以使用成熟稳定的bokeh.plotting API手动完成操作,然后时间可比MPL更快:

from time import time

import pandas as pd
import numpy as np

from bokeh.io import output_notebook, show
from bokeh.plotting import figure

output_notebook()

# Generate data
N = 100000
x1 = 2 + np.random.randn(N)
y1 = ['a'] * N

x2 = -2 + np.random.randn(N)
y2 = ['b'] * N

X = list(x1) + list(x2)
Y = y1 + y2

df = pd.DataFrame()
df['Vals'] = X
df['Class'] = Y

# Time the bokeh plot
start_time = time()

# find the quartiles and IQR for each category
groups = df.groupby('Class')
q1 = groups.quantile(q=0.25)
q2 = groups.quantile(q=0.5)
q3 = groups.quantile(q=0.75)
iqr = q3 - q1
upper = q3 + 1.5*iqr
lower = q1 - 1.5*iqr

cats = ['a', 'b']

p = figure(x_range=cats)

# if no outliers, shrink lengths of stems to be no longer than the minimums or maximums
qmin = groups.quantile(q=0.00)
qmax = groups.quantile(q=1.00)
upper.score = [min([x,y]) for (x,y) in zip(list(qmax.loc[:,'Vals']),upper.Vals)]
lower.score = [max([x,y]) for (x,y) in zip(list(qmin.loc[:,'Vals']),lower.Vals)]

# stems
p.segment(cats, upper.Vals, cats, q3.Vals, line_color="black")
p.segment(cats, lower.Vals, cats, q1.Vals, line_color="black")

# boxes
p.vbar(cats, 0.7, q2.Vals, q3.Vals, fill_color="#E08E79", line_color="black")
p.vbar(cats, 0.7, q1.Vals, q2.Vals, fill_color="#3B8686", line_color="black")

# whiskers (almost-0 height rects simpler than segments)
p.rect(cats, lower.Vals, 0.2, 0.01, line_color="black")
p.rect(cats, upper.Vals, 0.2, 0.01, line_color="black")

p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = "white"
p.grid.grid_line_width = 2
p.xaxis.major_label_text_font_size="12pt"

show(p)

end_time = time()
print("Total time taken for Bokeh is {0}".format(end_time - start_time))

这是一段代码,但很容易包装成可重用的函数。对我来说,上述代码的结果如下:

在此输入图片描述


好的,这让我的评论在问题下面变得不重要了。+1。 - roganjosh

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接