Python词法分析资源

Question

Python词法分析资源

3

作为一项教育练习，我计划用Python编写Python词法分析器。最终，我想实现一个能够运行自身的简单Python子集，因此我希望使用尽可能少的导入来编写这个词法分析器。

我发现的涉及词法分析的教程，例如kaleidoscope，只向前查看一个字符以确定接下来应该出现什么令牌，但我担心这对于Python来说是不足够的（首先，仅查看一个字符无法区分定界符或操作符，也无法区分标识符和关键字；此外，处理缩进看起来像是一个新的难题；还有其他问题）。

我发现这个link非常有帮助，但当我尝试实现它时，我的代码很快变得相当丑陋，有很多if语句和分支情况，并且似乎不是“正确”的做法。

是否有任何好的资源可以帮助/教我解析这种代码（我也想完全解析它，但首要任务是什么？）？

我并不排斥使用解析器生成器，但我希望生成的Python代码只使用Python的一个简单子集，并且相对自包含，这样我至少可以梦想拥有一种能够解释自身的语言。例如，从我查看的example来看，如果我使用PLY，我将需要我的语言解释PLY包来解释自身，这会使事情变得更加复杂。

- math4tots

1

词法分析器常常需要进行大量条件检查。这就是为什么把它放在词法分析器中，以便 if 语句不会出现在代码的其他地方。 - Emil Vikström

4个回答

1

我过去在类似的项目中使用传统的flex/lex和bison/yacc。我还使用过ply（python lex yacc），我发现它们之间的技能非常可转移。

所以，如果你从未编写过解析器，请使用ply编写第一个，并且你将学习到一些后续项目中非常有用的技能。

当你让你的ply解析器开始工作时，你可以手动制作一个作为教育练习。根据我的经验，手写词法分析器和语法分析器会变得非常混乱-因此解析器生成器的成功！

- Nick Craig-Wood

0

这个基于正则表达式的词法分析器已经为我服务过几次，效果不错：

#-------------------------------------------------------------------------------
# lexer.py
#
# A generic regex-based Lexer/tokenizer tool.
# See the if __main__ section in the bottom for an example.
#
# Eli Bendersky (eliben@gmail.com)
# This code is in the public domain
# Last modified: August 2010
#-------------------------------------------------------------------------------
import re
import sys


class Token(object):
    """ A simple Token structure.
        Contains the token type, value and position. 
    """
    def __init__(self, type, val, pos):
        self.type = type
        self.val = val
        self.pos = pos

    def __str__(self):
        return '%s(%s) at %s' % (self.type, self.val, self.pos)


class LexerError(Exception):
    """ Lexer error exception.

        pos:
            Position in the input line where the error occurred.
    """
    def __init__(self, pos):
        self.pos = pos


class Lexer(object):
    """ A simple regex-based lexer/tokenizer.

        See below for an example of usage.
    """
    def __init__(self, rules, skip_whitespace=True):
        """ Create a lexer.

            rules:
                A list of rules. Each rule is a `regex, type`
                pair, where `regex` is the regular expression used
                to recognize the token and `type` is the type
                of the token to return when it's recognized.

            skip_whitespace:
                If True, whitespace (\s+) will be skipped and not
                reported by the lexer. Otherwise, you have to 
                specify your rules for whitespace, or it will be
                flagged as an error.
        """
        # All the regexes are concatenated into a single one
        # with named groups. Since the group names must be valid
        # Python identifiers, but the token types used by the 
        # user are arbitrary strings, we auto-generate the group
        # names and map them to token types.
        #
        idx = 1
        regex_parts = []
        self.group_type = {}

        for regex, type in rules:
            groupname = 'GROUP%s' % idx
            regex_parts.append('(?P<%s>%s)' % (groupname, regex))
            self.group_type[groupname] = type
            idx += 1

        self.regex = re.compile('|'.join(regex_parts))
        self.skip_whitespace = skip_whitespace
        self.re_ws_skip = re.compile('\S')

    def input(self, buf):
        """ Initialize the lexer with a buffer as input.
        """
        self.buf = buf
        self.pos = 0

    def token(self):
        """ Return the next token (a Token object) found in the 
            input buffer. None is returned if the end of the 
            buffer was reached. 
            In case of a lexing error (the current chunk of the
            buffer matches no rule), a LexerError is raised with
            the position of the error.
        """
        if self.pos >= len(self.buf):
            return None
        else:
            if self.skip_whitespace:
                m = self.re_ws_skip.search(self.buf, self.pos)

                if m:
                    self.pos = m.start()
                else:
                    return None

            m = self.regex.match(self.buf, self.pos)
            if m:
                groupname = m.lastgroup
                tok_type = self.group_type[groupname]
                tok = Token(tok_type, m.group(groupname), self.pos)
                self.pos = m.end()
                return tok

            # if we're here, no rule matched
            raise LexerError(self.pos)

    def tokens(self):
        """ Returns an iterator to the tokens found in the buffer.
        """
        while 1:
            tok = self.token()
            if tok is None: break
            yield tok


if __name__ == '__main__':
    rules = [
        ('\d+',             'NUMBER'),
        ('[a-zA-Z_]\w+',    'IDENTIFIER'),
        ('\+',              'PLUS'),
        ('\-',              'MINUS'),
        ('\*',              'MULTIPLY'),
        ('\/',              'DIVIDE'),
        ('\(',              'LP'),
        ('\)',              'RP'),
        ('=',               'EQUALS'),
    ]

    lx = Lexer(rules, skip_whitespace=True)
    lx.input('erw = _abc + 12*(R4-623902)  ')

    try:
        for tok in lx.tokens():
            print(tok)
    except LexerError as err:
        print('LexerError at position %s' % err.pos)

- Eli Bendersky

0

考虑看看 PyPy，这是一个基于 Python 的 Python 实现。它显然也有一个 Python 解析器。

- ThiefMaster

1

他们的词法分析器是基于状态机编写的。它不仅仅是一个状态机（像任何明智的词法分析器一样），他们将标记描述为类似状态机的数据结构，并从中生成表驱动的词法分析器。我不确定这是否是初学者的好起点。 - user395760

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Sergey Miryanov · Accepted Answer

请查看http://pyparsing.wikispaces.com/，也许对您的任务有用。