使用Pyparsing以二叉树方式解析复杂的逻辑表达式

19

我正在尝试解析类似下面这样的复杂逻辑表达式;

x > 7 AND x < 8 OR x = 4

并将解析后的字符串作为二叉树获取。对于上述表达式,期望的解析表达式应该如下所示

[['x', '>', 7], 'AND', [['x', '<', 8], 'OR', ['x', '=', 4]]]

'OR'逻辑运算符的优先级高于'AND'运算符。括号可以覆盖默认优先级。为了更加通用,解析后的表达式应该如下所示;

'OR'逻辑运算符比'AND'运算符具有更高的优先级。使用括号可以覆盖默认优先级。为了更通用,解析后的表达式应该像这样;

<left_expr> <logical_operator> <right_expr>

另一个例子是

input_string = x > 7 AND x < 8 AND x = 4
parsed_expr  = [[['x', '>', 7], 'AND', ['x', ',', 8]], 'AND', ['x', '=', 4]]

到目前为止,我想出了这个简单的解决方案,但它无法以二叉树方式生成解析表达式。在存在与先前示例中相同的逻辑运算符连续出现的情况下,operatorPrecedence似乎对我没有帮助。

import pyparsing as pp
complex_expr = pp.Forward()
operator = pp.Regex(">=|<=|!=|>|<|=").setName("operator")
logical = (pp.Keyword("AND") | pp.Keyword("OR")).setName("logical")
vars = pp.Word(pp.alphas, pp.alphanums + "_") | pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
condition = (vars + operator + vars)
clause = pp.Group(condition ^ (pp.Suppress("(") + complex_expr + pp.Suppress(")") ))

expr = pp.operatorPrecedence(clause,[
                            ("OR", 2, pp.opAssoc.LEFT, ),
                            ("AND", 2, pp.opAssoc.LEFT, ),])

complex_expr << expr
print complex_expr.parseString("x > 7 AND x < 8 AND x = 4")

非常感谢您的意见和指导。

表达式(不包括括号)的BNF如下:

<expr>       -> <expr> | <expr> <logical> <expr>
<expr>       -> <opnd> <relational> <opnd>
<opnd>       -> <variable> | <numeric>
<relational> -> <'>'> | <'='> | <'>='> | <'<='> | <'!='>

你的代码有点难以理解,可以发布一下BNF语法吗? - georg
刚刚加入了BNF语法规则...但我不确定它是否明确无歧义。 - consumer
2个回答

16
注意:pyparsing 的 `operatorPrecedence` 方法已被弃用,推荐使用 `infixNotation` 方法。
尝试进行更改:
expr = pp.operatorPrecedence(clause,[ 
                            ("OR", 2, pp.opAssoc.LEFT, ), 
                            ("AND", 2, pp.opAssoc.LEFT, ),]) 

to:

expr = pp.operatorPrecedence(condition,[ 
                            ("OR", 2, pp.opAssoc.LEFT, ), 
                            ("AND", 2, pp.opAssoc.LEFT, ),]) 

operatorPrecedence的第一个参数是要与运算符一起使用的原始操作数,无需在括号中包含complexExpr - operatorPrecedence会为您完成这项工作。由于你的操作数实际上是另一个更深层次的比较,你可能需要考虑改变:
condition = (expr + operator + expr)

to:

condition = pp.Group(expr + operator + expr)

为了使operatorPrecedence的输出更易处理。通过这些更改,解析x > 7 AND x < 8 OR x = 4得到以下结果:
[[['x', '>', '7'], 'AND', [['x', '<', '8'], 'OR', ['x', '=', '4']]]]

这句话的意思是:“它识别OR的优先级更高,并首先对其进行分组。(您确定要使用这种AND和OR优先级的顺序吗?我认为传统顺序是相反的,如这篇维基百科条目所示。)”
“我认为你还在问为什么pyparsing和operatorPrecedence没有返回嵌套的二进制对结果,也就是说,你希望解析“A and B and C”会返回:”
[['A', 'and', 'B'] 'and', 'C']

但是你得到的是:
['A', 'and', 'B', 'and', 'C']

那是因为operatorPrecedence使用重复而不是递归来解析相同优先级的重复操作。请参见this question,它非常类似于您的问题,并且其答案包括一个解析操作,将您的重复解析树转换为更传统的二进制解析树。您还可以在pyparsing wiki页面上找到使用operatorPrecedence实现的样本布尔表达式解析器编辑: 为了澄清,这是我建议您将解析器缩小到的内容:
import pyparsing as pp

operator = pp.Regex(">=|<=|!=|>|<|=").setName("operator")
number = pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
identifier = pp.Word(pp.alphas, pp.alphanums + "_")
comparison_term = identifier | number 
condition = pp.Group(comparison_term + operator + comparison_term)

expr = pp.operatorPrecedence(condition,[
                            ("AND", 2, pp.opAssoc.LEFT, ),
                            ("OR", 2, pp.opAssoc.LEFT, ),
                            ])

print expr.parseString("x > 7 AND x < 8 OR x = 4")

如果您也想添加对 NOT 的支持,那么它将如下所示:
expr = pp.operatorPrecedence(condition,[
                            ("NOT", 1, pp.opAssoc.RIGHT, ),
                            ("AND", 2, pp.opAssoc.LEFT, ),
                            ("OR", 2, pp.opAssoc.LEFT, ),
                            ])

在某些时候,您可能想要通过一个更完整的算术表达式来扩展comparison_term的定义,该表达式使用自己的operatorPrecedence定义。我建议这样做,而不是创建一个大型的opPrec定义,因为您已经提到了一些opPrec的性能问题。如果您仍然遇到性能问题,请查看ParserElement.enablePackrat

嗨,保罗, 谢谢你的指导。不过我这里还有一个问题。使用你建议的三级优先级已经增加了解析时间,语法验证时间也大幅增长。你看到有没有可能提高查询性能? - consumer
1
我有点困惑我所建议的第三级是什么。实际上,我建议您删除 complex_exprclause 的添加,只使用 'expr'。我并没有建议您将任何内容添加到运算符列表中。我确实建议您重新审视运算符的顺序,因为传统数学在AND之前评估OR,而不是像您目前所做的那样反过来。 - PaulMcG
@PaulMcG 你好,字符串='response_code!=404 AND content='failed' OR code=abc',你能帮我吗?如何解析这个查询字符串? - Droid
你有一种新的comparison_term,一个带引号的字符串。我认为只需要改成comparison_term = identifier | number | pp.QuotedString("'")就可以了(未经测试)。将来使用expr.runTests(test_string),解析错误会被突出显示给你。 - PaulMcG

7

让我来推荐这种解析方法,它直接来自于Peter Norvig 在优达学城的计算机程序设计课程中介绍的方法(并针对您的需求进行了调整)。

from functools import update_wrapper
from string import split
import re

def grammar(description, whitespace=r'\s*'):
    """Convert a description to a grammar.  Each line is a rule for a
    non-terminal symbol; it looks like this:
        Symbol =>  A1 A2 ... | B1 B2 ... | C1 C2 ...
    where the right-hand side is one or more alternatives, separated by
    the '|' sign.  Each alternative is a sequence of atoms, separated by
    spaces.  An atom is either a symbol on some left-hand side, or it is
    a regular expression that will be passed to re.match to match a token.

    Notation for *, +, or ? not allowed in a rule alternative (but ok
    within a token). Use '\' to continue long lines.  You must include spaces
    or tabs around '=>' and '|'. That's within the grammar description itself.
    The grammar that gets defined allows whitespace between tokens by default;
    specify '' as the second argument to grammar() to disallow this (or supply
    any regular expression to describe allowable whitespace between tokens)."""
    G = {' ': whitespace}
    description = description.replace('\t', ' ') # no tabs!
    for line in split(description, '\n'):
        lhs, rhs = split(line, ' => ', 1)
        alternatives = split(rhs, ' | ')
        G[lhs] = tuple(map(split, alternatives))
    return G

def decorator(d):
    def _d(fn):
        return update_wrapper(d(fn), fn)
    update_wrapper(_d, d)
    return _d

@decorator
def memo(f):
    cache = {}
    def _f(*args):
        try:
            return cache[args]
        except KeyError:
            cache[args] = result = f(*args)
            return result
        except TypeError:
            # some element of args can't be a dict key
            return f(args)
    return _f

def parse(start_symbol, text, grammar):
    """Example call: parse('Exp', '3*x + b', G).
    Returns a (tree, remainder) pair. If remainder is '', it parsed the whole
    string. Failure iff remainder is None. This is a deterministic PEG parser,
    so rule order (left-to-right) matters. Do 'E => T op E | T', putting the
    longest parse first; don't do 'E => T | T op E'
    Also, no left recursion allowed: don't do 'E => E op T'"""

    tokenizer = grammar[' '] + '(%s)'

    def parse_sequence(sequence, text):
        result = []
        for atom in sequence:
            tree, text = parse_atom(atom, text)
            if text is None: return Fail
            result.append(tree)
        return result, text

    @memo
    def parse_atom(atom, text):
        if atom in grammar:  # Non-Terminal: tuple of alternatives
            for alternative in grammar[atom]:
                tree, rem = parse_sequence(alternative, text)
                if rem is not None: return [atom]+tree, rem  
            return Fail
        else:  # Terminal: match characters against start of text
            m = re.match(tokenizer % atom, text)
            return Fail if (not m) else (m.group(1), text[m.end():])

    # Body of parse:
    return parse_atom(start_symbol, text)

Fail = (None, None)

MyLang = grammar("""expression => block logicalop expression | block
block => variable operator number
variable => [a-z]+
operator => <=|>=|>|<|=
number => [-+]?[0-9]+
logicalop => AND|OR""", whitespace='\s*')

def parse_it(text):
    return parse('expression', text, MyLang)

print parse_it("x > 7 AND x < 8 AND x = 4")

输出:

(['expression', ['block', ['variable', 'x'], ['operator', '>'], ['number', '7']], ['logicalop', 'AND'], ['expression', ['block', ['variable', 'x'], ['operator', '<'], ['number', '8']], ['logicalop', 'AND'], ['expression', ['block', ['variable', 'x'], ['operator', '='], ['number', '4']]]]], '')

1
感谢Luke的建议...但是Paul提出的解决方案才是我正在寻找的。谢谢你花时间 :) - consumer

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接