如何获取与Python AST节点对应的源代码？

Question

如何获取与Python AST节点对应的源代码？

23

Python AST节点具有lineno和col_offset属性，它们指示相应代码范围的开始位置。是否有一种简单的方法来获取代码范围的结尾？第三方库？

- Aivar

我还需要一种方法来使用end-offset信息注释节点（就像您的解决方案），并支持python2。我正在考虑创建一个独立的模块来完成这个任务。这会引起兴趣吗？@Aivar，您对自己的方法满意吗？ - DS.

@DS 我对我的解决方案并不满意，因为它目前还不完整，而且偶尔会出现一些错误。但我没有看到其他更好的解决方案。一个替代方案是编写一个新的解析器，收集更多的信息，但我自己还没有准备好这样做。一个单独的包会非常好 - 有几个项目可以从中受益。请参见此人的想法: https://dev59.com/J1kR5IYBdhLWcg3w6RWo - Aivar

我正在尝试一种看起来很有前途的方法，它将每个节点与标记（来自tokenize模块）联系起来。你能分享一些会引起麻烦的例子吗？ - DS.

1

它在这里 https://github.com/gristlabs/asttokens，但我也添加了一个带有示例的单独答案。 - DS.

将它标记为重复，因为它有更新的答案将AST节点转换为Python代码 - ti7

显示剩余2条评论

4个回答

11

ast.get_source_segment 函数是在 Python 3.8 版本中添加的：

import ast

code = """
if 1 == 1 and 2 == 2 and 3 == 3:
     test = 1
"""
node = ast.parse(code)
ast.get_source_segment(code, node.body[0])

生成：如果 1 == 1 并且 2 == 2 并且 3 == 3：测试 = 1

感谢Blane在https://dev59.com/qVkS5IYBdhLWcg3wUVMG#62624882中的回答。

- Peter K

10

我们有类似的需求，我为此创建了asttokens库。它在文本和标记形式下维护源代码，并使用标记信息标记AST节点，从而可以轻松获取文本信息。

它适用于Python 2和3（已测试过2.7和3.5）。例如：

import ast, asttokens
st='''
def greet(a):
  say("hello") if a else say("bye")
'''
atok = asttokens.ASTTokens(st, parse=True)
for node in ast.walk(atok.tree):
  if hasattr(node, 'lineno'):
    print atok.get_text_range(node), node.__class__.__name__, atok.get_text(node)

打印

(1, 50) FunctionDef def greet(a):
  say("hello") if a else say("bye")
(17, 50) Expr say("hello") if a else say("bye")
(11, 12) Name a
(17, 50) IfExp say("hello") if a else say("bye")
(33, 34) Name a
(17, 29) Call say("hello")
(40, 50) Call say("bye")
(17, 20) Name say
(21, 28) Str "hello"
(40, 43) Name say
(44, 49) Str "bye"

- DS.

快速且容易的安装。这是一个非常棒的程序库。 - Glycerine

1

你好，我知道现在很晚了，但我认为这就是你要找的内容。我只对模块中的函数定义进行解析。我们可以通过这种方法获得ast节点的第一行和最后一行。这样，通过仅读取所需的行来解析源文件，可以获得函数定义的源代码行。这是一个非常简单的例子。

st='def foo():\n    print "hello" \n\ndef bla():\n    a = 1\n    b = 2\n  
c= a+b\n    print c'

import ast 
tree = ast.parse(st)
for function in tree.body:
    if isinstance(function,ast.FunctionDef):
        # Just in case if there are loops in the definition
        lastBody = func.body[-1]
        while isinstance (lastBody,(ast.For,ast.While,ast.If)):
            lastBody = lastBody.Body[-1]
        lastLine = lastBody.lineno
        print "Name of the function is ",function.name
        print "firstLine of the function is ",function.lineno
        print "LastLine of the function is ",lastLine
        print "the source lines are "
        if isinstance(st,str):
            st = st.split("\n")
        for i , line in enumerate(st,1):
            if i in range(function.lineno,lastLine+1):
                print line

- Sujay Narayanan

谢谢！不幸的是，这对我没有帮助。我需要所有节点的位置，不仅仅是行号，还包括列号。 - Aivar

哦，那我真的很抱歉。我会去找它并随时通知您。 - Sujay Narayanan

没错！当AST中选择任意表达式或语句节点时，我想要突出显示相应的代码范围。 - Aivar

我认为你只需要节点的第一行和最后一行。我也在做类似的事情。如果有帮助，请检查我的答案编辑。 - Sujay Narayanan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Aivar · Accepted Answer

编辑：最新代码（在Python 3.5-3.7中测试过）在此处：https://bitbucket.org/plas/thonny/src/master/thonny/ast_utils.py

由于我没有找到简单的方法，这里提供了一种困难的（可能不是最优的）方式。如果Python解析器中有更多lineno / col_offset错误而未在代码中提到（或工作），则可能会崩溃和/或工作不正确。在Python 3.3中进行了测试。

def mark_code_ranges(node, source):
    """
    Node is an AST, source is corresponding source as string.
    Function adds recursively attributes end_lineno and end_col_offset to each node
    which has attributes lineno and col_offset.
    """

    NON_VALUE_KEYWORDS = set(keyword.kwlist) - {'False', 'True', 'None'}


    def _get_ordered_child_nodes(node):
        if isinstance(node, ast.Dict):
            children = []
            for i in range(len(node.keys)):
                children.append(node.keys[i])
                children.append(node.values[i])
            return children
        elif isinstance(node, ast.Call):
            children = [node.func] + node.args

            for kw in node.keywords:
                children.append(kw.value)

            if node.starargs != None:
                children.append(node.starargs)
            if node.kwargs != None:
                children.append(node.kwargs)

            children.sort(key=lambda x: (x.lineno, x.col_offset))
            return children
        else:
            return ast.iter_child_nodes(node)    

    def _fix_triple_quote_positions(root, all_tokens):
        """
        http://bugs.python.org/issue18370
        """
        string_tokens = list(filter(lambda tok: tok.type == token.STRING, all_tokens))

        def _fix_str_nodes(node):
            if isinstance(node, ast.Str):
                tok = string_tokens.pop(0)
                node.lineno, node.col_offset = tok.start

            for child in _get_ordered_child_nodes(node):
                _fix_str_nodes(child)

        _fix_str_nodes(root)

        # fix their erroneous Expr parents   
        for node in ast.walk(root):
            if ((isinstance(node, ast.Expr) or isinstance(node, ast.Attribute))
                and isinstance(node.value, ast.Str)):
                node.lineno, node.col_offset = node.value.lineno, node.value.col_offset

    def _fix_binop_positions(node):
        """
        http://bugs.python.org/issue18374
        """
        for child in ast.iter_child_nodes(node):
            _fix_binop_positions(child)

        if isinstance(node, ast.BinOp):
            node.lineno = node.left.lineno
            node.col_offset = node.left.col_offset


    def _extract_tokens(tokens, lineno, col_offset, end_lineno, end_col_offset):
        return list(filter((lambda tok: tok.start[0] >= lineno
                                   and (tok.start[1] >= col_offset or tok.start[0] > lineno)
                                   and tok.end[0] <= end_lineno
                                   and (tok.end[1] <= end_col_offset or tok.end[0] < end_lineno)
                                   and tok.string != ''),
                           tokens))



    def _mark_code_ranges_rec(node, tokens, prelim_end_lineno, prelim_end_col_offset):
        """
        Returns the earliest starting position found in given tree, 
        this is convenient for internal handling of the siblings
        """

        # set end markers to this node
        if "lineno" in node._attributes and "col_offset" in node._attributes:
            tokens = _extract_tokens(tokens, node.lineno, node.col_offset, prelim_end_lineno, prelim_end_col_offset)
            #tokens = 
            _set_real_end(node, tokens, prelim_end_lineno, prelim_end_col_offset)

        # mark its children, starting from last one
        # NB! need to sort children because eg. in dict literal all keys come first and then all values
        children = list(_get_ordered_child_nodes(node))
        for child in reversed(children):
            (prelim_end_lineno, prelim_end_col_offset) = \
                _mark_code_ranges_rec(child, tokens, prelim_end_lineno, prelim_end_col_offset)

        if "lineno" in node._attributes and "col_offset" in node._attributes:
            # new "front" is beginning of this node
            prelim_end_lineno = node.lineno
            prelim_end_col_offset = node.col_offset

        return (prelim_end_lineno, prelim_end_col_offset)

    def _strip_trailing_junk_from_expressions(tokens):
        while (tokens[-1].type not in (token.RBRACE, token.RPAR, token.RSQB,
                                      token.NAME, token.NUMBER, token.STRING, 
                                      token.ELLIPSIS)
                    and tokens[-1].string not in ")}]"
                    or tokens[-1].string in NON_VALUE_KEYWORDS):
            del tokens[-1]

    def _strip_trailing_extra_closers(tokens, remove_naked_comma):
        level = 0
        for i in range(len(tokens)):
            if tokens[i].string in "({[":
                level += 1
            elif tokens[i].string in ")}]":
                level -= 1

            if level == 0 and tokens[i].string == "," and remove_naked_comma:
                tokens[:] = tokens[0:i]
                return

            if level < 0:
                tokens[:] = tokens[0:i]
                return   

    def _set_real_end(node, tokens, prelim_end_lineno, prelim_end_col_offset):
        # prelim_end_lineno and prelim_end_col_offset are the start of 
        # next positioned node or end of source, ie. the suffix of given
        # range may contain keywords, commas and other stuff not belonging to current node

        # Function returns the list of tokens which cover all its children


        if isinstance(node, _ast.stmt):
            # remove empty trailing lines
            while (tokens[-1].type in (tokenize.NL, tokenize.COMMENT, token.NEWLINE, token.INDENT)
                   or tokens[-1].string in (":", "else", "elif", "finally", "except")):
                del tokens[-1]

        else:
            _strip_trailing_extra_closers(tokens, not isinstance(node, ast.Tuple))
            _strip_trailing_junk_from_expressions(tokens)

        # set the end markers of this node
        node.end_lineno = tokens[-1].end[0]
        node.end_col_offset = tokens[-1].end[1]

        # Try to peel off more tokens to give better estimate for children
        # Empty parens would confuse the children of no argument Call
        if ((isinstance(node, ast.Call)) 
            and not (node.args or node.keywords or node.starargs or node.kwargs)):
            assert tokens[-1].string == ')'
            del tokens[-1]
            _strip_trailing_junk_from_expressions(tokens)
        # attribute name would confuse the "value" of Attribute
        elif isinstance(node, ast.Attribute):
            if tokens[-1].type == token.NAME:
                del tokens[-1]
                _strip_trailing_junk_from_expressions(tokens)
            else:
                raise AssertionError("Expected token.NAME, got " + str(tokens[-1]))
                #import sys
                #print("Expected token.NAME, got " + str(tokens[-1]), file=sys.stderr)

        return tokens

    all_tokens = list(tokenize.tokenize(io.BytesIO(source.encode('utf-8')).readline))
    _fix_triple_quote_positions(node, all_tokens)
    _fix_binop_positions(node)
    source_lines = source.split("\n") 
    prelim_end_lineno = len(source_lines)
    prelim_end_col_offset = len(source_lines[len(source_lines)-1])
    _mark_code_ranges_rec(node, all_tokens, prelim_end_lineno, prelim_end_col_offset)