或者更准确地说:哪些编程语言是由上下文无关文法定义的?
据我所知,C++不是上下文无关的,因为它包含诸如宏和模板等内容。我的直觉告诉我,函数式语言可能是上下文无关的,但我没有任何硬数据来支持这一点。
对于简洁的示例给予额外的奖励 :-)
或者更准确地说:哪些编程语言是由上下文无关文法定义的?
据我所知,C++不是上下文无关的,因为它包含诸如宏和模板等内容。我的直觉告诉我,函数式语言可能是上下文无关的,但我没有任何硬数据来支持这一点。
对于简洁的示例给予额外的奖励 :-)
哪些编程语言是无上下文的?[...]
我的直觉告诉我,函数式语言可能是无上下文的 [...]
简短版: 实际上几乎没有任何真实世界中的编程语言是以任何意义上的无上下文的。一个语言是否是无上下文与它是否是函数式没有任何关系,这只是语法复杂程度的问题。
这里有一个针对 Brainfuck 的命令式语言的 CFG:
Program → Instr Program | ε
Instr → '+' | '-' | '>' | '<' | ',' | '.' | '[' Program ']'
这里是SKI组合子演算的一个函数式CFG:
Program → E
E → 'S' E E E
E → 'K' E E
E → 'I'
E → '(' E ')'
更详细的版本:通常,无上下文文法(CFG)仅用于粗略指定语言的语法。必须区分语法正确的程序和编译/评估正确的程序。最常见的是,编译器将语言分析分为语法分析,该分析构建并验证代码的一般结构,以及语义分析,该分析验证程序的含义。
如果您所说的“无上下文语言”是指“...所有程序都能编译”,那么答案是:几乎没有。符合此要求的语言几乎没有规则或复杂功能,例如变量的存在,空格敏感性,类型系统或任何其他上下文:在一个地方定义并在另一个地方依赖的信息。
另一方面,如果“无上下文语言”仅意味着“...所有程序都通过语法分析”,那么答案就取决于语法本身的复杂程度。有许多语法特征很难或不可能仅使用CFG描述。其中一些通过向解析器添加附加状态来克服,以跟踪计数器,查找表等。
无法用CFG表达的句法特征示例:
Indentation- and whitespace-sensitive languages like Python and Haskell. Keeping track of arbitrarily nested indentation levels is essentially context-sensitive and requires separate counters for the indentation level; both how many spaces that are used for each level and how many levels there are.
Allowing only a fixed level of indentation using a fixed amount of spaces would work by duplicating the grammar for each level of indentation, but in practice this is inconvenient.
The C Typedef Parsing Problem says that C programs are ambiguous during lexical analysis because it cannot know from the grammar alone if something is a regular identifier or a typedef alias for an existing type.
The example is:
typedef int my_int;
my_int x;
At the semicolon, the type environment needs to be updated with an entry for my_int. But if the lexer has already looked ahead to my_int, it will have lexed it as an identifier rather than a type name.
In context-free grammar terms, the X → ...
rule that would trigger on my_int
is ambiguous: It could be either one that produces an identifier, or one that produces a typedef
'ed type; knowing which one relies on a lookup table (context) beyond the grammar itself.
Macro- and template-based languages like Lisp, C++, Template Haskell, Nim, and so on. Since the syntax changes as it is being parsed, one solution is to make the parser into a self-modifying program. See also Is C++ context-free or context-sensitive?
Often, operator precedence and associativity are not expressed directly in CFGs even though it is possible. For example, a CFG for a small expression grammar where ^ binds tighter than ×, and × binds tighter than +, might look like this:
E → E ^ E
E → E × E
E → E + E
E → (E)
E → num
This CFG is ambiguous, however, and is often accompanied by a precedence / associativity table saying e.g. that ^ binds tightest, × binds tighter than +, that ^ is right-associative, and that × and + are left-associative.
Precedence and associativity can be encoded into a CFG in a mechanical way such that it is unambiguous and only produces syntax trees where the operators behave correctly. An example of this for the grammar above:
E₀ → EA E₁
EA → E₁ + EA
EA → ε
E₁ → EM E₂
EM → E₂ × EM
EM → ε
E₂ → E₃ EP
EP → ^ E₃ EP
E₃ → num
E₃ → (E₀)
But ambiguous CFGs + precedence / associativity tables are common because they're more readable and because various types of LR parser generator libraries can produce more efficient parsers by eliminating shift/reduce conflicts instead of dealing with an unambiguous, transformed grammar of a larger size.
理论上,所有有限字符串集合都是正则语言,因此所有有限大小的合法程序都是正则语言。由于正则语言是上下文无关语言的子集,因而所有有限大小的程序都是上下文无关语言。这个论点继续说:
虽然可以认为允许只有不到一百万行程序的语言是可以接受的限制,但将编程语言描述为正则语言并不实际:描述会太大。
— Torben Morgensen 的 编译器设计基础,第 2.10.2 章
同样也适用于上下文无关文法。为了不同地回答你的子问题,
哪些编程语言是由上下文无关文法定义的?
大多数实际编程语言是由它们的实现定义的,而大多数用于实际编程语言的解析器要么是手写的,要么使用扩展上下文无关文法的解析器生成器。不幸的是,很难找到你最喜欢的语言的确切上下文无关文法。当你找到时,通常是以巴科斯-诺尔范式(BNF)或一个解析器规范的形式,这个规范很可能不是纯粹的上下文无关文法。
来自实际应用的语法规范示例:
对于几乎所有的语言来说,语法正确的程序集合是上下文无关的。
然而,对于几乎所有的语言来说,可编译的程序集合并非上下文无关的。例如,如果所有可编译的C程序组成的集合是上下文无关的,那么与正则表达式(也称为regex)求交集之后得到的仍然是上下文无关的C程序集合。
^int main\(void\) { int a+; a+ = a+; return 0; }$
虽然这个语言看起来是无上下文限制的,但它明显同构于已知的不是无上下文限制的语言a^kba^kba^k。
根据你对问题的理解方式,答案会有所不同。但是在我看来,正确的答案是所有现代编程语言实际上都是上下文敏感的。例如,没有上下文无关语法只接受符合语法规则的C程序。那些指向yacc/bison上下文无关语法的人没有抓住重点。
VHDL有一定的上下文敏感性:
VHDL is context-sensitive in a mean way. Consider this statement inside a process:
jinx := foo(1);
Well, depending on the objects defined in the scope of the process (and its enclosing scopes), this can be either:
- A function call
- Indexing an array
- Indexing an array returned by a parameter-less function call
To parse this correctly, a parser has to carry a hierarchical symbol table (with enclosing scopes), and the current file isn't even enough.
foo
can be a function defined in a package. So the parser should first analyze the packages imported by the file it's parsing, and figure out the symbols defined in them.This is just an example. The VHDL type/subtype system is a similarly context-sensitive mess that's very difficult to parse.
(Eli Bendersky,“解析VHDL[非常]困难”,2009年)
大多数现代编程语言都不是上下文无关的语言。作为证明,如果我深入研究CFL的根源,它对应的PDA机器无法处理像{ww | w是字符串}
这样的字符串匹配。因此,大多数编程语言都需要这个。
例子:
int fa; // w
fa=1; // ww as parser treat it like this