哪些当代计算机语言是 LL(1) 的?

35
(我正在用假期时间研究一些语言理论,如果这是一个幼稚的问题,请见谅。)
根据此处所述:

LL语法,特别是LL(1)语法,具有很大的实际意义,因为这些语法的解析器易于构建,并且许多计算机语言都是为了这个原因而设计成LL(1)。

出于好奇,那么哪些当代计算机语言是LL(1)?C、Java、C#还是Python属于这一类别吗?

我曾经读过一篇非常有趣的关于这个话题的文章[使用语法和正则表达式],并且认为我已经将其加入书签,但不幸的是好像没有。我似乎认为Perl是其中之一。(对于这个毫无帮助的评论,我感到抱歉) - Martin
1
@Martin Perl绝对不是LL(1)。解析Perl是不可判定的。 - sepp2k
那么反过来说,Perl在底层(LL4?)。有趣的是,这是唯一由语言学家编写的语言,因此相对于机器阅读而言,它应该更容易以人类的方式进行编写。 - Martin
1
更好的提问地点是Lambda the Ultimate。请参考相关问题Good languages with simple grammar - Guy Coder
2
@martin:LL(k)中的(k)指的是您可能需要查看的向前标记数,以便决定如何处理当前标记。许多语言根本不是LL,在实践中,增加k的值很少有帮助,尽管您可以通过允许k无限制地增加(例如,参见ANTLR)来增加LL解析的能力。在这种情况下,解析器不再是线性时间,您可能需要更强大的算法,例如LR。 - rici
啊,我完全误解了。我之前读到的是关于计算机语言语法复杂度的等级划分,它们在1-4的范围内,并且我认为是由“LL”引用的,这是我无法链接的有趣链接。对于造成的混淆,我感到抱歉。 - Martin
2个回答

28
我认为应该在那个维基百科引用上加上[需要引用],因为其中的假设至少值得怀疑。例如,有大量基于yacc的编译器构建工具,实际上可以非常容易地使用更强大(而且同样快速)的LALR算法构建解析器,有些还实现了更加强大的(在大多数实用语法中几乎同样快速的)GLR算法。手写解析器已经不再必要数十年了。[注1]
试图回答这个问题:
  1. Most computer languages are "technically" not LL because they are not even context-free. That would include languages which require identifiers to be declared (C, C++, C#, Java, etc.), as well as languages with preprocessors and/or macro facilities (C and C++, among others), languages with ambiguities which can only be resolved using semantic information (Perl would be the worst offender here, but C and C++ are also right up there). And, just to spread the joy around some more, it also includes languages which have Python-like layout awareness (Python, of course, and also Haskell).

    I put scare quotes around "technically" because there are a lot of people who will say that these exceptions "don't count". That is their opinion, and they are entitled to it (and anyway I've given up arguing about it, although I don't share that opinion). You could, for example, eliminate the preprocessor from C/C++ by preprocessing the text before applying the grammar, or preprocess whitespace-aware languages by inserting INDENT, NEWLINE and DEDENT tokens instead of the whitespace, after which you could make some kind of claim about a mystical "core language". (That is much harder to apply to C++ templates, which can only be eliminated by parsing the text, so you can't claim that the expansion can be done prior to parsing.)

    The claim that a language is not context-free because it requires identifiers to be declared is perhaps a bit more controversial. In some languages (not C/C++, in which the semantic analysis is required to avoid ambiguity), you could argue that a parse tree could be constructed without validating the declaration rule, which makes that rule "not syntactic". But it remains the case that you can enforce the declaration rule syntactically; you just cannot do it with a context-free grammar (you could use a Van Wijngaarden grammar, for example).

    In any case, a common parsing strategy is to recognize a superset of a language, and then reject non-conforming programs through a subsequent analysis of the resulting parse tree; in that case, the question becomes whether or not the superset is LL. In my opinion, in order for that to be meaningful, it is necessary that every valid program can be parsed to the correct syntax tree, which eliminates the usage of semantic analysis to disambiguate.

  2. The goal of parsing is to produce a syntax tree, not solely to recognize whether a text is valid or not. (You might miss this important fact if you skim over formal language textbooks which tend to focus on recognition, possibly because the construction of syntax trees is less interesting.) So it seems to be reasonable to insist that a proposed grammar actually represent the syntactic structure of the language.

    For example, you can recognize algebraic expressions using a simple regular language:

    (PREFIX* OPERAND POSTFIX*) (INFIX PREFIX* OPERAND POSTFIX*)*
    

    where PREFIX is the set of prefix operators as well as (, POSTFIX is the set of postfix operators (if any) as well as ), INFIX is the set of infix operators (addition, subtraction, multiplication, etc.), and OPERAND is an identifier or constant.

    That regular expression will not correctly reject expressions with unbalanced parentheses, but we already agreed that it was OK to recognize a superset of the language, right? :-)

    If desired, we could remove the parentheses from the PREFIX and POSTFIX sets and make OPERAND a recursive production. The resulting grammar is trivially LL(1) [Note 2]:

    operand    => IDENTIFIER | CONSTANT | '(' expression ')'
    term       => operand | PREFIX operand | operand POSTFIX
    expression => term | term INFIX expression
    

    The problem, however, is that this grammar does not capture operator precedence. It does not even attempt to recognize the fact that a+b*c and a*b+c are both sums, so that the top-level operator is + in both cases, and not whatever operator happens to come first in the expression. (If you were parsing APL, this wouldn't be an issue. But most languages conform to the usual expectations about operator precedence.)

    This point is important because an LL grammar cannot handle left-recursive productions, and you need a left-recursive production in order to correctly parse a left-associative operator. (That is, to correctly parse a-b-c as ((a-b)-c) rather than (a-(b-c)), which would have a different value.) Again, you could argue that this is a quibble, since you could post-process the parse tree in order to correct the associativity. This is certainly true, but the result is that the grammar you use to parse is different from the grammar you use to explain the syntax of the language. This might or might not bother you.

    It's worth adding here that there are languages (Haskell, Prolog, probably many more) in which you can define operators and their precedence in the program text. That obviously makes it impossible to generate a correct syntax tree without semantic analysis, and the usual approach to parsing such languages is to use precisely the simplified "alternating operand and operator" language outlined above. Once the operator precedences are all known, presumably at the end of the parse, all expressions are then reanalyzed using something like the Shunting Yard Algorithm, in order to produce the correct parse.

  3. Let's suppose we discard the above objections, and just ask "for which commonly used programming languages might an LL parser be used?"

    In order to avoid cheating, though, we should really require that the LL parser have fixed lookahead and not require backtracking. If you allow arbitrary lookahead and backtracking, then you considerably expand the domain of parseable languages, but I contend that you are no longer in the realm of LL. That will eliminate, for example, the template- and preprocessor-free subset of C++, even though the common compiler implementations use recursive descent parsers, since backtracking is required in order to resolve the "Most Vexing Parse" ambiguity. (Anyway, you can't really remove templates from C++, and parsing with templates is really hard. [Note 3])

    Java and Python were both designed to be LL(1) "pseudo-parseable". I believe Haskell falls into this category as well. C cannot be correctly parsed without semantic information (classic ambiguity: is (x)*(y) a cast or a multiplication? -- it depends on whether x has been typedef'ed or declared as a variable) but I'm pretty sure it can be captured with a non-backtracking recursive-descent parser. I haven't looked at C# in depth, but this answer by Eric Lippert suggests that it requires backtracking in some cases.

注释

  1. 当然,人们仍然这样做,而且在许多情况下有很好的理由(例如产生更好的错误消息)。但是,“构建LALR解析器很困难”不是一个好理由,因为它并不是。

  2. 这不完全是LL,因为我没有进行左因子分解。但是这很容易做到;我将把它留作练习。

  3. 请参见Is C++ context-free or context-sensitive?。还有托德·维尔德豪森经典的C++ Templates are Turing Complete


1
对我来说,这有点像物理计算机不是通用图灵机的事实,因为它们具有有限的内存。 这是显然的真理,但与图灵机提供了一种思考物理计算机的好方法是完全一致的。同样,现实生活中的编程语言几乎从来都不是上下文自由的,但这个显而易见的真实性与上下文自由语法提供了一种思考许多这样的语言的好方法是完全一致的。 - John Coleman
3
你的“技术上”似乎是基于未能区分“语言”和“语言的语法”的失败。 “时间像香蕉一样飞逝”是一个在语法上有效的句子,但从语义上来看没有多少意义。一个语言的语法是LL(1),并不要求所有在语法上有效的程序文本也都是有效的程序。此外,一个语言中的某些符号具有特殊的词汇表示形式(如Python中的缩进),并不会导致该语言不是LL(1)。你提到了VW-语法。VW-语法区分任意表示和语法终结符号。 - LHP
1
此外,VW语法相当特殊,因为它们是图灵等价语言,因此可以表达编程语言的全部语义。(正如Cleaveland和Uzgalis在他们的书“编程语言的语法”中所证明的那样。) - LHP
2
这似乎更像是一条扩展注释而不是一个答案。你说得对,像C和C++这样的语言是上下文敏感的,因为它们需要语义信息来解决语法歧义。但是声称所有需要声明标识符的语言都是上下文敏感的是有些牵强的。例如,Pascal需要声明,但那是一个语义要求--这些声明不是解决语法歧义或构建解析树所必需的。 - Adrian McCarthy
1
在某种程度上,这就是韵律的意义所在,也是乔姆斯基试图用关于无色绿色思想的语法正确的话语表达的内容。同样地,我实际上不需要了解 Pascal 程序的语义,就可以观察到标识符不在作用域内。在我看来,这更类似于主谓不一致(句法)而不是像愤怒的睡眠这样的语义违规。正如我所说,我接受并不是每个人都会从这个角度来看待语法,但这绝对是我作为数学语言学家学习术语的方式,我认为... - rici
显示剩余13条评论

3
如果我将“计算机语言”定义得比“编程语言”更广泛,我相信有几种声明性语言是可以考虑的,尽管这也可能取决于你认为什么是当代的。
可能的候选者包括: - XML - LLVM IR - 许多配置语言,例如INI文件 - 一些网络协议 - 一些数据格式,例如CSV文件
大多数(全部?)描述正则表达式的语言版本都不是正则表达式,而是LL(1)。
对于实际的编程语言,这些语言都已经过时,很多都有常见的扩展,可能违反了LL(k)。
这些包括: - 汇编语言 - BASIC - LISP - Wirth语言:Pascal、Modula、Oberon - Prolog
我不太熟悉上述所有语言,因此建议它们是潜在的候选者。如果您在任何语言中发现语法歧义,请在评论中指出。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接