Shunting-yard算法中如何处理额外的运算符

6

给定以下输入: 3+4+ 算法将它转换成 3 4 + +

我可以在执行后缀表达式时发现错误。 但是,在转换过程中是否可能发现这个错误呢?

(维基百科文章和我阅读的互联网文章没有处理此情况)

谢谢


这个转换使用了哪个算法? - Bill
http://en.wikipedia.org/wiki/Shunting-yard_algorithm - mikbal
如果您将“3+4+”作为输入进行后缀转换,则您的代码应首先进行验证并返回“无效格式”。 - Bill
1
@Bill 验证中缀表达式与转换为后缀表达式一样困难,特别是当我有像^ ! ~这样的单输入运算符时。就像需要两个独立的解析器来处理表达式一样。 - mikbal
1个回答

28

除了括号不匹配之外,正则表达式可以用来验证有效的表达式。(不匹配的括号会被 shunting-yard 算法捕捉到,如维基百科页面所述,因此我忽略了这些。)

正则表达式如下:

PRE* OP POST* (INF PRE* OP POST*)*

其中:

PRE  is a prefix operator or (
POST is a postfix operator or )
INF  is an infix operator
OP   is an operand (a literal or a variable name)

你提供的维基百科页面包含函数调用但不包括数组下标;如果你想处理这些情况,那么:

PRE  is a prefix operator or (
POST is a postfix operator or ) or ]
INF  is an infix operator or ( or ,
OP   is an operand, including function names

需要注意的一点是上面提到的PREINF处于不相交的上下文中;也就是说,如果PRE是有效的,则INF无效,反之亦然。这意味着,使用相同的符号作为前缀运算符和中缀运算符是明确无误的。此外,PREPOST也处于不相交的上下文中,因此您可以使用相同的符号作为前缀和后缀运算符。但是,不能将相同的符号用于后缀和中缀运算符。例如,考虑C/C++运算符:

-  prefix or infix
+  prefix or infix
-- prefix or postfix
++ prefix or postfix

我之前隐式地利用了这个特性,使得(既可以作为表达式分组符(实际上是前缀),也可以作为函数调用符(中缀,因为它位于函数名和参数列表之间;操作符是“调用”)。

通常最常见的实现方式是将该正则表达式作为状态机,因为只有两个状态:

 +-----+            +-----+
 |State|-----OP---->|State|
 |  1  |<----INF----|  2  |
 |     |---+        |     |---+
 +-----+   |        +-----+   |
    ^     PRE          ^     POST
    |      |           |      |
    +------+           +------+

我们可以将状态1称为“希望操作数”,将状态2称为“拥有操作数”。一个简单的实现只需将维基百科中介绍的Shunting Yard算法分成两个循环,类似于这样(如果您不喜欢goto,可以消除它,但这确实是展示状态机最清晰的方式)。

want_operand:
  read a token. If there are no more tokens, announce an error.
  if the token is an prefix operator or an '(':
    mark it as prefix and push it onto the operator stack
    goto want_operand
  if the token is an operand (identifier or variable):
    add it to the output queue
    goto have_operand
  if the token is anything else, announce an error and stop. (**)

have_operand:
  read a token
  if there are no more tokens:
    pop all operators off the stack, adding each one to the output queue.
    if a `(` is found on the stack, announce an error and stop.
  if the token is a postfix operator:
    mark it as postfix and add it to the output queue
    goto have_operand.
  if the token is a ')':
    while the top of the stack is not '(':
      pop an operator off the stack and add it to the output queue
    if the stack becomes empty, announce an error and stop.
    if the '(' is marked infix, add a "call" operator to the output queue (*)
    pop the '(' off the top of the stack
    goto have_operand
  if the token is a ',':
    while the top of the stack is not '(':
      pop an operator off the stack and add it to the output queue
    if the stack becomes empty, announce an error
    goto want_operand
  if the token is an infix operator:
    (see the wikipeda entry for precedence handling)
    (normally, all prefix operators are considered to have higher precedence
    than infix operators.)
    got to want_operand
  otherwise, token is an operand. Announce an error

(*) The procedure as described above does not deal gracefully with parameter lists;
    when the postfix expression is being evaluated and a "call" operator is found, it's
    not clear how many arguments need to be evaluated. It might be that function names
    are clearly identifiable, so that the evaluator can just attach arguments to the
    "call" until a function name is found. But a cleaner solution is for the "," handler
    to increment the argument count of the "call" operator (that is, the open
    parenthesis marked as "infix"). The ")" handler also needs to increment the
    argument count.

(**) The state machine as presented does not correctly handle function calls with 0
    arguments. This call will show up as a ")" read in the want_operand state and with
    a "call" operator on top of the stack. If the "call" operator is marked with an
    argument count, as above, then the argument count must be 0 when the ")" is read,
    and in this case, unlike the have_operand case, the argument count must not be
    incremented.

1
这是一篇非常棒的文章 :). 正则表达式是一个有趣的想法。可以调用解析器的非起始部分,因为每个函数调用/变量/构造函数可能包含单独的表达式。这使得任何状态机测试基本上都变成了LR表。作为初步的错误系统,我正在计算两个操作数(+ * - / %),并检查if (#of2oper + 1 == #ofValue)。这是可能的,因为值是使用它们自己的解析器(函数调用/变量解析器/构造函数等)单独解析的。所以我在这个问题上采用了混合栈/递归的方法 :)。谢谢! - mikbal
1
@mikbal:你可能可以使用运算符优先级语法来处理所有值。维基百科页面展示了如何进行函数调用,我也对此进行了扩展;这也适用于构造函数。运算符优先级确实是一种LR技术;LALR(1)解析器被证明更强大,但许多语言可以使用op-prec / shunting-yard / Pratt或其他变体同样很好地解析。 - rici
这个答案超越了问题中提到的具体错误。它通过状态机提供了一种验证算法输入是否为有效中缀符号的机制。 - denver
1
感谢您提供如此美丽和清晰的解释。这应该链接到官方维基百科页面,以获取更多详细信息和简洁性。7年过去了,这仍然是最具信息量的帖子。 - rane

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接