简而言之:
如何在计算机模型中对语法产生式进行建模,以便同一左部具有无限数量的产生式?
我正在研究形式语言理论,并试图编写一个用于构建正则文法对象的类,这些对象可以传递给有限状态机。我的天真尝试是为每个允许的输入创建一个产生式API。下面是一个简化版本(基于正式文法定义 G =(N,Σ,P,S)
):
class ContextFreeGrammar:
def __init__(self, variables, alphabet, production_rules, start_variable):
self.variables = variables
self.alphabet = alphabet
self.production_rules = production_rules
self.start_variable = start_variable
def __repr__(self):
return '{}({}, {}, {}, {})'.format(
self.__class__.__name__,
self.variables,
self.alphabet,
self.production_rules,
self.start_variable
)
class RegularGrammar(ContextFreeGrammar):
_regular_expression_grammar = None # TODO
@classmethod
def from_regular_expression(cls, regular_expression):
raise NotImplementedError()
我还没有开始编写有限状态自动机或下推自动机。
正则表达式的语法是上下文无关的,因此我在 WSN 中包含了我的定义:
syntax = expression .
expression = term "|" expression .
expression = term .
term = factor repetition term .
term = factor term .
term = .
repetition = "*" .
repetition = "+" .
repetition = "?" .
repetition = "{" nonnegative_integer "," nonnegative_integer "}" .
repetition = "{" nonnegative_integer ",}" .
repetition = "{," nonnegative_integer "}" .
nonnegative_integer = nonzero_arabic_numeral arabic_numerals .
nonnegative_integer = arabic_numeral .
nonzero_arabic_numeral = "1" .
nonzero_arabic_numeral = "2" .
nonzero_arabic_numeral = "3" .
nonzero_arabic_numeral = "4" .
nonzero_arabic_numeral = "5" .
nonzero_arabic_numeral = "6" .
nonzero_arabic_numeral = "7" .
nonzero_arabic_numeral = "8" .
nonzero_arabic_numeral = "9" .
arabic_numeral = nonzero_arabic_numeral .
arabic_numeral = "0" .
arabic_numerals = arabic_numeral .
arabic_numerals = arabic_numeral arabic_numerals .
factor = "(" expression ")" .
factor = character_class .
factor = character .
escaped_character = "\\." .
escaped_character = "\\(" .
escaped_character = "\\)" .
escaped_character = "\\+" .
escaped_character = "\\*" .
escaped_character = "\\?" .
escaped_character = "\\[" .
escaped_character = "\\]" .
escaped_character = "\\\\" .
escaped_character = "\\{" .
escaped_character = "\\}" .
escaped_character = "\\|" .
character -> TODO ;
character_class = TODO .
可以很容易地看出,我明确地将替代项分成单独的产生式。我这样做是为了方便实现。但是我不知道如何处理字符类等内容。我希望production_rules
是从每个左侧到其相应右侧集合的映射。但现在看起来不可行。
.
通配符,我知道它可以是任何可能的字符。但是如果我假设我正在使用 Unicode,那么就有很多可能的字符。Unicode 7.0 包含 112,956 个字符。我认为为了包含需要多个代码点的字符,我将放弃字符类中的范围。这使得这个问题稍微容易一些。我想我可能会为普通字符类和否定字符类分别创建一个子类set
或类似的东西,并将句点转换为空的否定字符类。 - Tyler Crompton