哪些编程语言是无上下文的？

Question

哪些编程语言是无上下文的？

compiler-theorycontext-free-grammar

89

或者更准确地说：哪些编程语言是由上下文无关文法定义的？

据我所知，C++不是上下文无关的，因为它包含诸如宏和模板等内容。我的直觉告诉我，函数式语言可能是上下文无关的，但我没有任何硬数据来支持这一点。

对于简洁的示例给予额外的奖励 :-)

- n3rd

9个回答

49

对于几乎所有的语言来说，语法正确的程序集合是上下文无关的。

然而，对于几乎所有的语言来说，可编译的程序集合并非上下文无关的。例如，如果所有可编译的C程序组成的集合是上下文无关的，那么与正则表达式（也称为regex）求交集之后得到的仍然是上下文无关的C程序集合。

^int main\(void\) { int a+; a+ = a+; return 0; }$

虽然这个语言看起来是无上下文限制的，但它明显同构于已知的不是无上下文限制的语言a^kba^kba^k。

- Dave

2

+1 对于正确答案。被接受的答案是误导性的。 - Derrick Turk

2

如果有人想要证明语言 { a^k b a^k b a^k b | k >= 0 } 不是正则的，可以参考Torben Mogensen的《编译器设计基础》第2.10.2章节：http://www.diku.dk/~torbenm/Basics。 - sshine

7

“合乎语法”和“合乎语义”的区别有些误导人，因为对于编程语言中感兴趣的情况，“合乎语义”通过上下文敏感文法在语法上表达。说一个程序是“合乎语法”的，仅因为上下文无关文法可以接受它，是错误的，因为有效程序的语言需要限制仅可由上下文无关文法接受的字符串集，这意味着它们不能用上下文无关文法来描述。无论如何，支持+1。 - Nikos M.

2

第一句话没有太多意义。什么时候程序在语法上是正确的？通常情况下，当编译器能够接受它而不出现“语法错误”，并且能够生成代码时，程序就是语法上正确的。你正确地指出了大多数语言在这个意义上并不是无上下文的。另一方面，如果你将“语法上正确”定义为“被编译器的解析过程所接受”，那么这句话就是同义反复了，因为解析器使用的是无上下文文法，从而接受编程语言的无上下文超集。 - gernot

1

请查看https://cs.stackexchange.com/a/140082/60775以获取更好的答案。 - gernot

8

根据你对问题的理解方式，答案会有所不同。但是在我看来，正确的答案是所有现代编程语言实际上都是上下文敏感的。例如，没有上下文无关语法只接受符合语法规则的C程序。那些指向yacc/bison上下文无关语法的人没有抓住重点。

- starflyer

6

举一个最具戏剧性的非上下文无关语法的例子，Perl的语法是图灵完备的，据我所知。（参考链接）

- Devin Jeanpierre

3

如果我理解你的问题，你正在寻找可以通过上下文无关文法（CFG）描述的编程语言，以便CFG生成所有有效的程序并且仅生成有效的程序。

我相信大多数（如果不是全部）现代编程语言都不是上下文无关的。例如，一旦你有了用户定义的类型（在现代语言中非常常见），你就自动变成了上下文相关的语言。

验证程序语法和验证程序语义正确性之间存在差别。检查语法是上下文无关的，而检查语义正确性则不是（在大多数语言中）。

然而，这并不意味着这样的语言不存在。例如，未类型化的λ演算可以使用上下文无关文法来描述，并且当然是图灵完备的。

- Ginandi

2

VHDL有一定的上下文敏感性：

VHDL is context-sensitive in a mean way. Consider this statement inside a process:
jinx := foo(1);
Well, depending on the objects defined in the scope of the process (and its enclosing scopes), this can be either:

A function call

Indexing an array

Indexing an array returned by a parameter-less function call

To parse this correctly, a parser has to carry a hierarchical symbol table (with enclosing scopes), and the current file isn't even enough. foo can be a function defined in a package. So the parser should first analyze the packages imported by the file it's parsing, and figure out the symbols defined in them.

This is just an example. The VHDL type/subtype system is a similarly context-sensitive mess that's very difficult to parse.

(Eli Bendersky，“解析VHDL[非常]困难”，2009年)

- Graham Gimbert

1

我是一个新用户，不允许发布链接。 - Graham Gimbert

@Graham：不，您完全可以。新用户在发布链接方面没有限制。 - Konrad Rudolph

2

大多数现代编程语言都不是上下文无关的语言。作为证明，如果我深入研究CFL的根源，它对应的PDA机器无法处理像{ww | w是字符串}这样的字符串匹配。因此，大多数编程语言都需要这个。

例子：

int fa; // w
fa=1; // ww as parser treat it like this

- P.R.

1

让我们以Swift为例，用户可以定义运算符，包括运算符优先级和结合性。例如，操作符+和*实际上是在标准库中定义的。

上下文无关文法和词法分析器可能能够解析a+b-c*d+e，但语义是“五个操作数a、b、c、d和e，由操作符+、-、*和+分隔”。这就是一个不知道运算符的解析器可以实现的。上下文无关文法和词法分析器也可能能够解析a+-+b-+-c，其中有三个操作数a、b和c，由操作符+-+和-+-分隔。

解析器可以根据上下文无关的Swift文法“解析”源文件，但这远远不能完成工作。另一步将收集有关运算符的知识，然后将a+b-c*d+e的语义更改为与operator+(operator-(operator+(a,b),operator*(c,d)),e)相同。

因此，存在（或者可能存在，我没有仔细检查）上下文无关文法，但它只能帮助你解析程序到一定程度。

- gnasher729

-1

我认为Haskell和ML都支持上下文无关文法。请参考这个链接了解Haskell的相关信息。

- RoboAlex

1

没有办法说 Haskell 是上下文无关的，因为你可以定义具有自定义优先级和结合性的自定义运算符。你必须将 Haskell 剥离到非常有限的子集才能使其成为上下文无关的。 - Beefster

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- sshine · Accepted Answer

哪些编程语言是无上下文的？[...]

我的直觉告诉我，函数式语言可能是无上下文的 [...]

简短版： 实际上几乎没有任何真实世界中的编程语言是以任何意义上的无上下文的。一个语言是否是无上下文与它是否是函数式没有任何关系，这只是语法复杂程度的问题。

这里有一个针对 Brainfuck 的命令式语言的 CFG:

Program → Instr Program | ε
Instr → '+' | '-' | '>' | '<' | ',' | '.' | '[' Program ']'

这里是SKI组合子演算的一个函数式CFG：

Program → E
E → 'S' E E E
E → 'K' E E
E → 'I'
E → '(' E ')'

这些CFG可以识别两种语言的所有有效程序，因为它们非常简单。

更详细的版本：通常，无上下文文法（CFG）仅用于粗略指定语言的语法。必须区分语法正确的程序和编译/评估正确的程序。最常见的是，编译器将语言分析分为语法分析，该分析构建并验证代码的一般结构，以及语义分析，该分析验证程序的含义。

如果您所说的“无上下文语言”是指“...所有程序都能编译”，那么答案是：几乎没有。符合此要求的语言几乎没有规则或复杂功能，例如变量的存在，空格敏感性，类型系统或任何其他上下文：在一个地方定义并在另一个地方依赖的信息。

另一方面，如果“无上下文语言”仅意味着“...所有程序都通过语法分析”，那么答案就取决于语法本身的复杂程度。有许多语法特征很难或不可能仅使用CFG描述。其中一些通过向解析器添加附加状态来克服，以跟踪计数器，查找表等。

无法用CFG表达的句法特征示例：

Indentation- and whitespace-sensitive languages like Python and Haskell. Keeping track of arbitrarily nested indentation levels is essentially context-sensitive and requires separate counters for the indentation level; both how many spaces that are used for each level and how many levels there are.

Allowing only a fixed level of indentation using a fixed amount of spaces would work by duplicating the grammar for each level of indentation, but in practice this is inconvenient.
The C Typedef Parsing Problem says that C programs are ambiguous during lexical analysis because it cannot know from the grammar alone if something is a regular identifier or a typedef alias for an existing type.

The example is:
```
  typedef int my_int;
  my_int x;
```
At the semicolon, the type environment needs to be updated with an entry for my_int. But if the lexer has already looked ahead to my_int, it will have lexed it as an identifier rather than a type name.

In context-free grammar terms, the X → ... rule that would trigger on my_int is ambiguous: It could be either one that produces an identifier, or one that produces a typedef'ed type; knowing which one relies on a lookup table (context) beyond the grammar itself.
Macro- and template-based languages like Lisp, C++, Template Haskell, Nim, and so on. Since the syntax changes as it is being parsed, one solution is to make the parser into a self-modifying program. See also Is C++ context-free or context-sensitive?
Often, operator precedence and associativity are not expressed directly in CFGs even though it is possible. For example, a CFG for a small expression grammar where ^ binds tighter than ×, and × binds tighter than +, might look like this:
```
  E → E ^ E
  E → E × E
  E → E + E
  E → (E)
  E → num
```
This CFG is ambiguous, however, and is often accompanied by a precedence / associativity table saying e.g. that ^ binds tightest, × binds tighter than +, that ^ is right-associative, and that × and + are left-associative.

Precedence and associativity can be encoded into a CFG in a mechanical way such that it is unambiguous and only produces syntax trees where the operators behave correctly. An example of this for the grammar above:
```
  E₀ → EA E₁
  EA → E₁ + EA
  EA → ε
  E₁ → EM E₂
  EM → E₂ × EM
  EM → ε
  E₂ → E₃ EP
  EP → ^ E₃ EP
  E₃ → num
  E₃ → (E₀)
```
But ambiguous CFGs + precedence / associativity tables are common because they're more readable and because various types of LR parser generator libraries can produce more efficient parsers by eliminating shift/reduce conflicts instead of dealing with an unambiguous, transformed grammar of a larger size.

理论上，所有有限字符串集合都是正则语言，因此所有有限大小的合法程序都是正则语言。由于正则语言是上下文无关语言的子集，因而所有有限大小的程序都是上下文无关语言。这个论点继续说：

虽然可以认为允许只有不到一百万行程序的语言是可以接受的限制，但将编程语言描述为正则语言并不实际：描述会太大。
— Torben Morgensen 的编译器设计基础，第 2.10.2 章

同样也适用于上下文无关文法。为了不同地回答你的子问题，

哪些编程语言是由上下文无关文法定义的？

大多数实际编程语言是由它们的实现定义的，而大多数用于实际编程语言的解析器要么是手写的，要么使用扩展上下文无关文法的解析器生成器。不幸的是，很难找到你最喜欢的语言的确切上下文无关文法。当你找到时，通常是以巴科斯-诺尔范式（BNF）或一个解析器规范的形式，这个规范很可能不是纯粹的上下文无关文法。

来自实际应用的语法规范示例：