及其相应的答案让我思考如何高效地解析一个单一的数学表达式(通常类似于此答案https://dev59.com/K3RB5IYBdhLWcg3wiHv7#594294)由一个(更或少可信的)用户给出的20k到30k个输入值来自数据库。我实现了一个快速而简单的基准测试,以便比较不同的解决方案。
# Runs with Python 3(.4)
import pprint
import time
# This is what I have
userinput_function = '5*(1-(x*0.1))' # String - numbers should be handled as floats
demo_len = 20000 # Parameter for benchmark (20k to 30k in real life)
print_results = False
# Some database, represented by an array of dicts (simplified for this example)
database_xy = []
for a in range(1, demo_len, 1):
database_xy.append({
'x':float(a),
'y_eval':0,
'y_sympya':0,
'y_sympyb':0,
'y_sympyc':0,
'y_aevala':0,
'y_aevalb':0,
'y_aevalc':0,
'y_numexpr': 0,
'y_simpleeval':0
})
# 解决方案 #1:eval [是的,完全不安全]
time_start = time.time()
func = eval("lambda x: " + userinput_function)
for item in database_xy:
item['y_eval'] = func(item['x'])
time_end = time.time()
if print_results:
pprint.pprint(database_xy)
print('1 eval: ' + str(round(time_end - time_start, 4)) + ' seconds')
# 解决方案 #2a: sympy - evalf (http://www.sympy.org)
import sympy
time_start = time.time()
x = sympy.symbols('x')
sympy_function = sympy.sympify(userinput_function)
for item in database_xy:
item['y_sympya'] = float(sympy_function.evalf(subs={x:item['x']}))
time_end = time.time()
if print_results:
pprint.pprint(database_xy)
print('2a sympy: ' + str(round(time_end - time_start, 4)) + ' seconds')
# 解决方案 #2b: Sympy - lambdify (http://www.sympy.org)
from sympy.utilities.lambdify import lambdify
import sympy
import numpy
time_start = time.time()
sympy_functionb = sympy.sympify(userinput_function)
func = lambdify(x, sympy_functionb, 'numpy') # returns a numpy-ready function
xx = numpy.zeros(len(database_xy))
for index, item in enumerate(database_xy):
xx[index] = item['x']
yy = func(xx)
for index, item in enumerate(database_xy):
item['y_sympyb'] = yy[index]
time_end = time.time()
if print_results:
pprint.pprint(database_xy)
print('2b sympy: ' + str(round(time_end - time_start, 4)) + ' seconds')
# 解决方案 #2c: 使用sympy - lambdify和numexpr [以及numpy] (http://www.sympy.org)
from sympy.utilities.lambdify import lambdify
import sympy
import numpy
import numexpr
time_start = time.time()
sympy_functionb = sympy.sympify(userinput_function)
func = lambdify(x, sympy_functionb, 'numexpr') # returns a numpy-ready function
xx = numpy.zeros(len(database_xy))
for index, item in enumerate(database_xy):
xx[index] = item['x']
yy = func(xx)
for index, item in enumerate(database_xy):
item['y_sympyc'] = yy[index]
time_end = time.time()
if print_results:
pprint.pprint(database_xy)
print('2c sympy: ' + str(round(time_end - time_start, 4)) + ' seconds')
# 解决方案 #3a:asteval [基于ast] - 带有字符串魔法(http://newville.github.io/asteval/index.html)
from asteval import Interpreter
aevala = Interpreter()
time_start = time.time()
aevala('def func(x):\n\treturn ' + userinput_function)
for item in database_xy:
item['y_aevala'] = aevala('func(' + str(item['x']) + ')')
time_end = time.time()
if print_results:
pprint.pprint(database_xy)
print('3a aeval: ' + str(round(time_end - time_start, 4)) + ' seconds')
# 解决方案 #3b (M Newville):asteval [基于ast] - 解析和运行(http://newville.github.io/asteval/index.html)
from asteval import Interpreter
aevalb = Interpreter()
time_start = time.time()
exprb = aevalb.parse(userinput_function)
for item in database_xy:
aevalb.symtable['x'] = item['x']
item['y_aevalb'] = aevalb.run(exprb)
time_end = time.time()
print('3b aeval: ' + str(round(time_end - time_start, 4)) + ' seconds')
# 解决方案 #3c (M Newville): asteval [基于ast] - 使用numpy解析和运行 (http://newville.github.io/asteval/index.html)
from asteval import Interpreter
import numpy
aevalc = Interpreter()
time_start = time.time()
exprc = aevalc.parse(userinput_function)
x = numpy.array([item['x'] for item in database_xy])
aevalc.symtable['x'] = x
y = aevalc.run(exprc)
for index, item in enumerate(database_xy):
item['y_aevalc'] = y[index]
time_end = time.time()
print('3c aeval: ' + str(round(time_end - time_start, 4)) + ' seconds')
# 解决方案 #4: simpleeval [基于ast] (https://github.com/danthedeckie/simpleeval)
from simpleeval import simple_eval
time_start = time.time()
for item in database_xy:
item['y_simpleeval'] = simple_eval(userinput_function, names={'x': item['x']})
time_end = time.time()
if print_results:
pprint.pprint(database_xy)
print('4 simpleeval: ' + str(round(time_end - time_start, 4)) + ' seconds')
# 解决方案 #5 numexpr [和 numpy] (https://github.com/pydata/numexpr)
import numpy
import numexpr
time_start = time.time()
x = numpy.zeros(len(database_xy))
for index, item in enumerate(database_xy):
x[index] = item['x']
y = numexpr.evaluate(userinput_function)
for index, item in enumerate(database_xy):
item['y_numexpr'] = y[index]
time_end = time.time()
if print_results:
pprint.pprint(database_xy)
print('5 numexpr: ' + str(round(time_end - time_start, 4)) + ' seconds')
在我的旧测试机上(Python 3.4,Linux 3.11 x86_64,双核,1.8GHz),我得到了以下结果:
1 eval: 0.0185 seconds
2a sympy: 10.671 seconds
2b sympy: 0.0315 seconds
2c sympy: 0.0348 seconds
3a aeval: 2.8368 seconds
3b aeval: 0.5827 seconds
3c aeval: 0.0246 seconds
4 simpleeval: 1.2363 seconds
5 numexpr: 0.0312 seconds
突出的是eval的不可思议的速度,但我不想在实际生活中使用它。第二好的解决方案似乎是依赖于numpy的numexpr,虽然这不是硬性要求,但我想避免这种依赖。下一个最好的选择是围绕ast构建的simpleeval。另一个基于ast的解决方案aeval的问题在于,我必须先将每个单独的浮点输入值转换为字符串,而我找不到解决方法。最初我最喜欢sympy,因为它提供了最灵活和显然最安全的解决方案,但它最终以惊人的距离落后于倒数第二个解决方案。 更新1:有一种更快的方法可以使用sympy。请参见2b解决方案。它几乎和numexpr一样好,尽管我不确定sympy是否真正在内部使用它。 更新2:现在sympy的实现使用sympify而不是simplify(由其首席开发人员asmeurer推荐-感谢)。除非明确要求使用numexpr(参见解决方案2c),否则不会使用它。我还添加了两个基于asteval的显着更快的解决方案(感谢M Newville)。
我有哪些选项可以进一步加快任何相对更安全的解决方案?例如,是否可以直接使用ast进行其他安全(或相对安全)的方法?
eval
/compile
?(但这并不能防止拒绝服务攻击。) - Ry-lambdify
不会使用numexpr
,除非你设置modules='numexpr'
。 - asmeurersympify()
函数几乎和eval()
一样不安全。你能为它们添加一个注释吗? - user|a-b|
表示abs(a-b)
,以及对应于您预编译表达式的延迟求值表达式。plusminus有一个在线的、面向互联网开放的演示,您可以尝试一下。 - PaulMcG