不,你说得完全正确,pyparsing 不像正则表达式 "[ab]*a"
那样进行回溯。除非你明确实现它,否则 pyparsing 不会进行任何形式的前瞻。
这是你原始解析器的扩展版本,增加了 setName
和 setDebug
调用:
a = Literal('a').setName("A").setDebug()
b = Literal('b').setName("B").setDebug()
grammar = (a | b)[...] + a().setName("trailing_a").setDebug()
grammar.runTests("""\
aba
""")
当解析“aba”时,以下是调试输出:
Match A at loc 0(1,1)
Matched A -> ['a']
Match A at loc 1(1,2)
Exception raised:Expected A, found 'b' (at char 1), (line:1, col:2)
Match B at loc 1(1,2)
Matched B -> ['b']
Match A at loc 2(1,3)
Matched A -> ['a']
Match A at loc 3(1,4)
Exception raised:Expected A, found end of text (at char 3), (line:1, col:4)
Match B at loc 3(1,4)
Exception raised:Expected B, found end of text (at char 3), (line:1, col:4)
Match trailing_a at loc 3(1,4)
Exception raised:Expected trailing_a, found end of text (at char 3), (line:1, col:4)
aba
^
FAIL: Expected trailing_a, found end of text (at char 3), (line:1, col:4
你可以看到trailing_a
被匹配为初始重复的一部分,而不是作为一个trailing_a
。由于现在没有实际的尾随'a',所以解析失败。
你可以为前导重复定义一个特殊形式的a
(就像你在一行中所做的那样),像这样分成两行:
leading_a = a + FollowedBy(a | b)
grammar = (leading_a | b)[...] + 'a'
通过调试输出,我们可以跟踪解析器的逻辑:
Match leading_a at loc 0(1,1)
Match A at loc 0(1,1)
Matched A -> ['a']
Match A at loc 1(1,2)
Exception raised:Expected A, found 'b' (at char 1), (line:1, col:2)
Match B at loc 1(1,2)
Matched B -> ['b']
Matched leading_a -> ['a']
Match leading_a at loc 1(1,2)
Match A at loc 1(1,2)
Exception raised:Expected A, found 'b' (at char 1), (line:1, col:2)
Exception raised:Expected A, found 'b' (at char 1), (line:1, col:2)
Match B at loc 1(1,2)
Matched B -> ['b']
Match leading_a at loc 2(1,3)
Match A at loc 2(1,3)
Matched A -> ['a']
Match A at loc 3(1,4)
Exception raised:Expected A, found end of text (at char 3), (line:1, col:4)
Match B at loc 3(1,4)
Exception raised:Expected B, found end of text (at char 3), (line:1, col:4)
Exception raised:Expected {A | B}, found end of text (at char 3), (line:1, col:4)
Match B at loc 2(1,3)
Exception raised:Expected B, found 'a' (at char 2), (line:1, col:3)
aba
['a', 'b', 'a']
或者定义一个特殊的trailing_a
,并使用stopOn
参数来使用ZeroOrMore
:
trailing_a = a + ~FollowedBy(a | b)
grammar = OneOrMore(a | b, stopOn=trailing_a) + 'a'
获取类似的结果。
编辑
将语法更改为(a | b)[...]
会显示此调试输出:
Match A at loc 0(1,1)
Matched A -> ['a']
Match A at loc 1(1,2)
Exception raised:Expected A, found 'b' (at char 1), (line:1, col:2)
Match B at loc 1(1,2)
Matched B -> ['b']
Match A at loc 2(1,3)
Matched A -> ['a']
Match A at loc 3(1,4)
Exception raised:Expected A, found end of text (at char 3), (line:1, col:4)
Match B at loc 3(1,4)
Exception raised:Expected B, found end of text (at char 3), (line:1, col:4)
aba
['a', 'b', 'a']
是的,前瞻会带来性能损失。
pyparsing包括一个内部缓存功能,也称为“packrat解析”。以下是调试输出,其中缓存值用“*”标记:
Match trailing_a at loc 0(1,1)
Match A at loc 0(1,1)
Matched A -> ['a']
Match A at loc 1(1,2)
Exception raised:Expected A, found 'b' (at char 1), (line:1, col:2)
Match B at loc 1(1,2)
Matched B -> ['b']
Exception raised:Found unwanted token, FollowedBy:({A | B}), found 'b' (at char 1), (line:1, col:2)
* Match A at loc 0(1,1)
* Matched A -> ['a']
Match trailing_a at loc 1(1,2)
* Match A at loc 1(1,2)
* Exception raised:Expected A, found 'b' (at char 1), (line:1, col:2)
* Exception raised:Expected A, found 'b' (at char 1), (line:1, col:2)
* Match A at loc 1(1,2)
* Exception raised:Expected A, found 'b' (at char 1), (line:1, col:2)
* Match B at loc 1(1,2)
* Matched B -> ['b']
Match trailing_a at loc 2(1,3)
Match A at loc 2(1,3)
* Matched A -> ['a']
Match A at loc 3(1,4)
Exception raised:Expected A, found end of text (at char 3), (line:1, col:4)
Match B at loc 3(1,4)
Exception raised:Expected B, found end of text (at char 3), (line:1, col:4)
Matched trailing_a -> ['a']
* Match trailing_a at loc 2(1,3)
* Match A at loc 2(1,3)
* Matched A -> ['a']
* Matched trailing_a -> ['a']
"Match"操作摘要:
- 无先行断言:6
- 有先行断言:12
- 有先行断言+packrat:9
最后需要注意的是,可以在解析时注入额外的逻辑,使用解析动作(parse actions)。解析动作可以编写为一个方法(可以返回修改后的标记集,或引发异常),或者作为谓词函数(返回True或False,在返回False时,pyparsing将引发异常)。
因此,您可以使用最快的无先行断言形式编写语法,然后在之后添加一个验证条件来运行:
grammar = (a | b)[...]
grammar.addCondition(lambda t: t[-1] == 'a', message="string does not end with 'a'")
很有可能解析时间的节省足以抵消执行单独条件评估的额外成本。
(a | b)*
语法相比效率如何吗?特别是,每个标记是否被解析两次,还是可以缓存结果?启用packrat解析会有影响吗? - ldbo