获取Clang AST背后的源码

Question

获取Clang AST背后的源码

19

在clang中给定一个AST对象，如何获取其背后的代码？我尝试编辑教程中的代码，并添加了以下内容：

clang::SourceLocation _b = d->getLocStart(), _e = d->getLocEnd();
char *b = sourceManager->getCharacterData(_b),
      e = sourceManager->getCharacterData(_E);
llvm:errs() << std::string(b, e-b) << "\n";

但遗憾的是，它没有打印出整个typedef声明，只有大约一半！当打印Expr时也发生了同样的现象。

我该如何打印并查看构成声明的整个原始字符串？

- mikebloch

我认为结束源位置指向范围中的最后一个标记（而不是超出末尾），因此您将错过最后一个标记。 - bames53

@bames53 看起来你是正确的！那么我该如何获取这个最后一个标记呢？ - mikebloch

除了第三行应该是“_e”而不是“_w”之外，最后一行的差异难道不是反过来了吗？（即“e-b”而不是“b-e”） - Konrad Rudolph

4个回答

7

以下代码对我有效。

std::string decl2str(clang::Decl *d, SourceManager &sm) {
    // (T, U) => "T,,"
    string text = Lexer::getSourceText(CharSourceRange::getTokenRange(d->getSourceRange()), sm, LangOptions(), 0);
    if (text.size() > 0 && (text.at(text.size()-1) == ',')) //the text can be ""
        return Lexer::getSourceText(CharSourceRange::getCharRange(d->getSourceRange()), sm, LangOptions(), 0);
    return text;
}

- LucasWang

6

正如回答的评论所指出的那样，所有其他答案似乎都有缺陷，因此我将发布自己的代码，该代码似乎涵盖了评论中提到的所有缺陷。

我认为getSourceRange()将语句视为标记序列，而不是字符序列。这意味着，如果我们有一个对应于FOO + BAR的clang::Stmt，那么标记FOO在第1个字符处，标记+在第5个字符处，标记BAR在第7个字符处。getSourceRange()因此返回一个SourceRange，其本质上意味着"这段代码从1处的标记开始，到7处的标记结束"。因此，我们必须使用clang::Lexer::getLocForEndOfToken（stmt.getSourceRange().getEnd()）来获得BAR标记的实际字符位置，并将那个作为clang::Lexer::getSourceText的结束位置参数传递。如果不这样做，则clang::Lexer::getSourceText将返回"FOO + "，而不是我们想要的"FOO + BAR"。

我不认为我的实现有@Steven Lu在评论中提到的问题，因为这个代码使用了clang::Lexer::getSourceText函数，根据Clang的源代码文档，这个函数专门设计用于从范围内获取源文本。

此实现还考虑了@Ramin Halavati的备注;我已经在一些代码上进行了测试，它确实返回了宏展开的字符串。

这是我的实现：

/**
 * Gets the portion of the code that corresponds to given SourceRange, including the
 * last token. Returns expanded macros.
 * 
 * @see get_source_text_raw()
 */
std::string get_source_text(clang::SourceRange range, const clang::SourceManager& sm) {
    clang::LangOptions lo;

    // NOTE: sm.getSpellingLoc() used in case the range corresponds to a macro/preprocessed source.
    auto start_loc = sm.getSpellingLoc(range.getBegin());
    auto last_token_loc = sm.getSpellingLoc(range.getEnd());
    auto end_loc = clang::Lexer::getLocForEndOfToken(last_token_loc, 0, sm, lo);
    auto printable_range = clang::SourceRange{start_loc, end_loc};
    return get_source_text_raw(printable_range, sm);
}

/**
 * Gets the portion of the code that corresponds to given SourceRange exactly as
 * the range is given.
 *
 * @warning The end location of the SourceRange returned by some Clang functions 
 * (such as clang::Expr::getSourceRange) might actually point to the first character
 * (the "location") of the last token of the expression, rather than the character
 * past-the-end of the expression like clang::Lexer::getSourceText expects.
 * get_source_text_raw() does not take this into account. Use get_source_text()
 * instead if you want to get the source text including the last token.
 *
 * @warning This function does not obtain the source of a macro/preprocessor expansion.
 * Use get_source_text() for that.
 */
std::string get_source_text_raw(clang::SourceRange range, const clang::SourceManager& sm) {
    return clang::Lexer::getSourceText(clang::CharSourceRange::getCharRange(range), sm, clang::LangOptions());
}

- AnthonyD973

嗨。我尝试匹配一个ForStmt并使用get_source_text(fs->getBody()->getSourceRange(), sm)来获取循环体的源代码。然后我发现如果循环体是一个单独的语句（而不是一个块），结束的分号会丢失。对此有什么想法吗？ - undefined

1

Elazar的方法对我有效，但涉及宏时无效。以下更正解决了这个问题：

std::string decl2str(clang::Decl *d) {
    clang::SourceLocation b(d->getLocStart()), _e(d->getLocEnd());
    if (b.isMacroID())
        b = sm->getSpellingLoc(b);
    if (e.isMacroID())
        e = sm->getSpellingLoc(e);
    clang::SourceLocation e(clang::Lexer::getLocForEndOfToken(_e, 0, *sm, lopt));
    return std::string(sm->getCharacterData(b),
        sm->getCharacterData(e)-sm->getCharacterData(b));
}

- Ramin Halavati

我尝试了类似于二元运算符左手边的表达式。如果它包含宏，我会得到标识符而不是令牌。为什么会这样呢？例如，在我的程序中有一个宏#define abc ab int ab; int main() { abc = 0; } ，当我将代码作为输入时，clang 将采取预处理代码。abc 应该被替换为 ab，但当我打印字符串时，左侧仍然是 abc。为什么会发生这种情况？ - Jon marsh

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Elazar Leibovich · Accepted Answer

19

使用 Lexer 模块：

clang::SourceManager *sm;
clang::LangOptions lopt;

std::string decl2str(clang::Decl *d) {
    clang::SourceLocation b(d->getLocStart()), _e(d->getLocEnd());
    clang::SourceLocation e(clang::Lexer::getLocForEndOfToken(_e, 0, *sm, lopt));
    return std::string(sm->getCharacterData(b),
        sm->getCharacterData(e)-sm->getCharacterData(b));
}

- Elazar Leibovich

1

这是一个非常好的答案，但是当我尝试使用这个函数打印一个段落注释 /** ... */ 时，输出结果不包含起始和结束标记。

std::string fullComment2str(comments::FullComment* comment, clang::SourceManager *sm, clang::LangOptions lopt) {     if (!comment) {         return std::string();     }     clang::SourceLocation b(comment->getLocStart()), _e(comment->getLocEnd());     clang::SourceLocation e(clang::Lexer::getLocForEndOfToken(_e, 0, *sm, lopt));     return std::string(sm->getCharacterData(b),         sm->getCharacterData(e)-sm->getCharacterData(b)); }

- Konrad Kleine

2

请注意，我对这种方法进行了一些测试，有时对于非常大量的代码，getCharacterData() 的结果可能不会产生来自相同缓冲区的 char 指针... 我曾经遇到过 "end" 指针落在堆栈上，而 "begin" 指针指向堆中其他位置的情况... 这会导致工具崩溃或使您的终端充满垃圾内存。 - Steven Lu

@StevenLu 你明白这些方法有什么问题吗？我该如何修复它？ - Elazar Leibovich

说实话，我不确定有什么简单的方法可以做到这一点，但是我确实有一种稳定的方法来获取FileID和开始和结束位置的offsets/line+col（SourceManager方法），考虑到Rewriter工作正常，这已经足够满足我的需求了。对于大型声明如何可靠地抓取完整字符串，我需要再仔细研究一下。 - Steven Lu

@StevenLu 我也遇到了同样的问题，并发现 @LucasWang 在下面提供的解决方案完美地解决了它。关键似乎是使用“Lexer”来组装“StringRef”。 - Michael Koval