如何解析语法 `(a | b)* a`

Question

如何解析语法 `(a | b)* a`

3

让我们考虑下面的Backus-Naur形式所描述的语法：

a ::= 'a'
b ::= 'b'
grammar ::= (a | b)* a

我正在尝试使用pyparsing进行解析，并使用以下实现方式

a = Literal('a')
b = Literal('b')
grammar = (a | b)[...] + 'a'

然而，它未能解析语法描述的任何字符串，例如grammar.parseString('aba')会引发错误。

ParseException: Expected "a", found end of text  (at char 3), (line:1, col:4)

似乎是由于表达式 [...] 在解析时会一直消耗标记，直到不再可能解析为止。然后，最后一个字面量没有要解析的标记。

解决方法是使用 FollowedBy 类：

grammar = ((a | b) + FollowedBy(a | b))[...] + a

这种方法可以运作。然而，它非常不优雅，似乎也不高效，并且不是很通用。

有没有更好的方式使用 pyparsing 解析这个语法？

- ldbo

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- PaulMcG · Accepted Answer

不，你说得完全正确，pyparsing 不像正则表达式 "[ab]*a" 那样进行回溯。除非你明确实现它，否则 pyparsing 不会进行任何形式的前瞻。

这是你原始解析器的扩展版本，增加了 setName 和 setDebug 调用：

a = Literal('a').setName("A").setDebug()
b = Literal('b').setName("B").setDebug()
grammar = (a | b)[...] + a().setName("trailing_a").setDebug()

grammar.runTests("""\
    aba
    """)

当解析“aba”时，以下是调试输出：

Match A at loc 0(1,1)
Matched A -> ['a']
Match A at loc 1(1,2)
Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
Match B at loc 1(1,2)
Matched B -> ['b']
Match A at loc 2(1,3)
Matched A -> ['a']
Match A at loc 3(1,4)
Exception raised:Expected A, found end of text  (at char 3), (line:1, col:4)
Match B at loc 3(1,4)
Exception raised:Expected B, found end of text  (at char 3), (line:1, col:4)
Match trailing_a at loc 3(1,4)
Exception raised:Expected trailing_a, found end of text  (at char 3), (line:1, col:4)

aba
   ^
FAIL: Expected trailing_a, found end of text  (at char 3), (line:1, col:4

你可以看到trailing_a被匹配为初始重复的一部分，而不是作为一个trailing_a。由于现在没有实际的尾随'a'，所以解析失败。

你可以为前导重复定义一个特殊形式的a（就像你在一行中所做的那样），像这样分成两行：

leading_a = a + FollowedBy(a | b)
grammar = (leading_a | b)[...] + 'a'

通过调试输出，我们可以跟踪解析器的逻辑：

Match leading_a at loc 0(1,1)
Match A at loc 0(1,1)
Matched A -> ['a']
Match A at loc 1(1,2)
Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
Match B at loc 1(1,2)
Matched B -> ['b']
Matched leading_a -> ['a']
Match leading_a at loc 1(1,2)
Match A at loc 1(1,2)
Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
Match B at loc 1(1,2)
Matched B -> ['b']
Match leading_a at loc 2(1,3)
Match A at loc 2(1,3)
Matched A -> ['a']
Match A at loc 3(1,4)
Exception raised:Expected A, found end of text  (at char 3), (line:1, col:4)
Match B at loc 3(1,4)
Exception raised:Expected B, found end of text  (at char 3), (line:1, col:4)
Exception raised:Expected {A | B}, found end of text  (at char 3), (line:1, col:4)
Match B at loc 2(1,3)
Exception raised:Expected B, found 'a'  (at char 2), (line:1, col:3)

aba
['a', 'b', 'a']

或者定义一个特殊的trailing_a，并使用stopOn参数来使用ZeroOrMore：

trailing_a = a + ~FollowedBy(a | b)
grammar = OneOrMore(a | b, stopOn=trailing_a) + 'a'

获取类似的结果。

编辑将语法更改为(a | b)[...]会显示此调试输出：

Match A at loc 0(1,1)
Matched A -> ['a']
Match A at loc 1(1,2)
Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
Match B at loc 1(1,2)
Matched B -> ['b']
Match A at loc 2(1,3)
Matched A -> ['a']
Match A at loc 3(1,4)
Exception raised:Expected A, found end of text  (at char 3), (line:1, col:4)
Match B at loc 3(1,4)
Exception raised:Expected B, found end of text  (at char 3), (line:1, col:4)

aba
['a', 'b', 'a']

是的，前瞻会带来性能损失。

pyparsing包括一个内部缓存功能，也称为“packrat解析”。以下是调试输出，其中缓存值用“*”标记：

  Match trailing_a at loc 0(1,1)
  Match A at loc 0(1,1)
  Matched A -> ['a']
  Match A at loc 1(1,2)
  Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
  Match B at loc 1(1,2)
  Matched B -> ['b']
  Exception raised:Found unwanted token, FollowedBy:({A | B}), found 'b'  (at char 1), (line:1, col:2)
* Match A at loc 0(1,1)
* Matched A -> ['a']
  Match trailing_a at loc 1(1,2)
* Match A at loc 1(1,2)
* Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
* Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
* Match A at loc 1(1,2)
* Exception raised:Expected A, found 'b'  (at char 1), (line:1, col:2)
* Match B at loc 1(1,2)
* Matched B -> ['b']
  Match trailing_a at loc 2(1,3)
  Match A at loc 2(1,3)
* Matched A -> ['a']
  Match A at loc 3(1,4)
  Exception raised:Expected A, found end of text  (at char 3), (line:1, col:4)
  Match B at loc 3(1,4)
  Exception raised:Expected B, found end of text  (at char 3), (line:1, col:4)
  Matched trailing_a -> ['a']
* Match trailing_a at loc 2(1,3)
* Match A at loc 2(1,3)
* Matched A -> ['a']
* Matched trailing_a -> ['a']

"Match"操作摘要：

无先行断言：6
有先行断言：12
有先行断言+packrat：9

最后需要注意的是，可以在解析时注入额外的逻辑，使用解析动作（parse actions）。解析动作可以编写为一个方法（可以返回修改后的标记集，或引发异常），或者作为谓词函数（返回True或False，在返回False时，pyparsing将引发异常）。

因此，您可以使用最快的无先行断言形式编写语法，然后在之后添加一个验证条件来运行：

grammar = (a | b)[...]
grammar.addCondition(lambda t: t[-1] == 'a', message="string does not end with 'a'")

很有可能解析时间的节省足以抵消执行单独条件评估的额外成本。