如何使'git diff'忽略注释

Question

如何使'git diff'忽略注释

38

我正在尝试列出在特定提交中更改的文件列表。问题在于每个文件都在文件顶部的注释中具有版本号 - 并且由于此提交引入了新版本，那就意味着每个文件都已更改。
我不关心已更改的注释，因此我希望使用git diff 忽略所有与 ^ \ s * \ *。* $ 匹配的行，因为这些都是注释（/* */的一部分）。我找不到任何告诉 git diff 忽略特定行的方法。我已经尝试设置textconv属性，以使Git在对它们进行diff之前将文件传递给sed，以便sed可以删除有问题的行 - 但是，问题是 git diff --name-status 实际上并未对文件进行差异比较，仅仅是比较哈希值，而且当然所有哈希值都已更改。有办法做到这一点吗？

- Benubird

猜测一下... 你试过 git diff --name-status --textconv 吗？或者是 git diff --name-only？ - rodrigo

是的，我正在使用--name-only，但它返回了每个文件（就像我说的那样），因为每个文件都已经更改了它的注释。 --textconv不起作用，因为正如我在帖子中也提到的那样，当不生成完整差异时，git会忽略它。 - Benubird

1

可能是忽略在git diff中匹配字符串的更改的重复问题。 - richvdh

1

@richvdh 我认为这两个问题足够相似，可以视为重复，但它们有不同的正确答案，而且这个问题还提出了其他建议，另一个问题没有提到，所以我认为保留这两个问题是有价值的。 - Benubird

1

相关：Git 2.30（2021年第一季度）将提出git diff -I<regex>。 - VonC

8个回答

15

git diff -G <regex>

并指定一个正则表达式，该正则表达式不与您的版本号行匹配。

- riezebosch

11

我发现使用git difftool启动外部差异工具最容易：

git difftool -y -x "diff -I '<regex>'"

- richvdh

git中有限的正则表达式支持使其成为一个很好的选择，谢谢！ - undefined

4

我找到了一个解决方案。我可以使用这个命令：

git diff --numstat --minimal <commit> <commit> | sed '/^[1-]\s\+[1-]\s\+.*/d'

为了显示在提交之间有多行更改的文件，排除仅在注释中更改版本号的文件。

- Benubird

2

使用'grep'命令在'git diff'输出中查找特定内容

git diff -w | grep -c -E "(^[+-]\s*(\/)?\*)|(^[+-]\s*\/\/)"

只有注释行的更改可以被计算。(A)

使用'git diff --stat'输出，

git diff -w --stat

所有行变化都可以计算出来。(B)

要获取非注释源代码行变化(NCSL)数量，从(B)中减去(A)。

解释：

在'git diff'输出中(忽略空格更改)，

注意以'+'或'-'开头的行，表示修改的行。
这后面可以有可选的空白字符'\s*'
然后查找注释行模式'/*' (或)'*' (或)'//'
由于使用了'-c'选项，只需打印计数。删除'-c'选项以单独查看注释差异。

注意：由于以下假设可能存在一些注释行计数的小错误，因此结果应该作为近似值。

1.) Source files are based on the C language. Makefile and shell script files have a different convention, '#', to denote the comment lines and if they are part of diffset, their comment lines won't be counted.
2.) The Git convention of line change: If a line is modified, Git sees it as that particular line is deleted and a new line is inserted there and it may look like two lines are changed whereas in reality one line is modified.
```
 In the below example, the new definition of 'FOO' looks like a two-line change.

 $  git diff --stat -w abc.h
 ...
 -#define FOO 7
 +#define FOO 105
 ...
 1 files changed, 1 insertions(+), 1 deletions(-)
 $
```
3.) Valid comment lines not matching the pattern (or) Valid source code lines matching the pattern can cause errors in the calculation.

在下面的例子中，不以“*”开头的“+ blah blah”行将不被检测为注释行。

           + /*
           +  blah blah
           + *
           + */

在下面的例子中，"+ *ptr"这行因为以*开头，虽然它是一个有效的源代码行，但会被计为注释行。

            + printf("\n %p",
            +         *ptr);

- Saravanan Palanisamy

1

对于大多数编程语言，要正确地执行此操作，您必须解析原始源文件/ast，并以此方式排除注释。

一个原因是多行注释的开头可能没有被差异覆盖。另一个原因是语言解析并不是简单的，经常会有一些可以使天真的解析器出错的东西。

我本来想为Python做这个，但字符串处理已经足够满足我的需求了。

对于Python，您可以使用自定义过滤器忽略注释和尝试忽略文档字符串，例如：


#!/usr/bin/env python

import sys
import re
import configparser
from fnmatch import fnmatch
from unidiff import PatchSet

EXTS = ["py"]


class Opts:  # pylint: disable=too-few-public-methods
    debug = False
    exclude = []


def filtered_hunks(fil):
    path_re = ".*[.](%s)$" % "|".join(EXTS)
    for patch in PatchSet(fil):
        if not re.match(path_re, patch.path):
            continue
        excluded = False
        if Opts.exclude:
            if Opts.debug:
                print(">", patch.path, "=~", Opts.exclude)
            for ex in Opts.exclude:
                if fnmatch(patch.path, ex):
                    excluded = True
        if excluded:
            continue
        for hunk in patch:
            yield hunk


class Typ:  # pylint: disable=too-few-public-methods
    LINE = "."
    COMMENT = "#"
    DOCSTRING = "d"
    WHITE = "w"


def classify_lines(fil):
    for hunk in filtered_hunks(fil):
        yield from classify_hunk(hunk)


def classify_line(lval):
    """Classify a single python line, noting comments, best efforts at docstring start/stop and pure-whitespace."""
    lval = lval.rstrip("\n\r")
    remaining_lval = lval
    typ = Typ.LINE
    if re.match(r"^ *$", lval):
        return Typ.WHITE, None, ""

    if re.match(r"^ *#", lval):
        typ = Typ.COMMENT
        remaining_lval = ""
    else:
        slug = re.match(r"^ *(\"\"\"|''')(.*)", lval)
        if slug:
            remaining_lval = slug[2]
            slug = slug[1]
            return Typ.DOCSTRING, slug, remaining_lval
    return typ, None, remaining_lval


def classify_hunk(hunk):
    """Classify lines of a python diff-hunk, attempting to note comments and docstrings.

    Ignores context lines.
    Docstring detection is not guaranteed (changes in the middle of large docstrings won't have starts.)
    Using ast would fix, but seems like overkill, and cannot be done on a diff-only.
    """

    p = ""
    prev_typ = 0
    pslug = None
    for line in hunk:
        lval = line.value
        lval = lval.rstrip("\n\r")
        typ = Typ.LINE
        naive_typ, slug, remaining_lval = classify_line(lval)
        if p and p[-1] == "\\":
            typ = prev_typ
        else:
            if prev_typ != Typ.DOCSTRING and naive_typ == Typ.COMMENT:
                typ = naive_typ
            elif naive_typ == Typ.DOCSTRING:
                if prev_typ == Typ.DOCSTRING and pslug == slug:
                    # remainder of line could have stuff on it
                    typ, _, _ = classify_line(remaining_lval)
                else:
                    typ = Typ.DOCSTRING
                    pslug = slug
            elif prev_typ == Typ.DOCSTRING:
                # continue docstring found in this context/hunk
                typ = Typ.DOCSTRING

        p = lval
        prev_typ = typ

        if typ == Typ.DOCSTRING:
            if re.match(r"(%s) *$" % pslug, remaining_lval):
                prev_typ = Typ.LINE

        if line.is_context:
            continue

        yield typ, lval


def count_lines(fil):
    """Totals changed lines of python code, attempting to strip comments and docstrings.

    Deletes/adds are counted equally.
    Could miss some things, don't rely on exact counts.
    """

    count = 0

    for (typ, line) in classify_lines(fil):
        if Opts.debug:
            print(typ, line)
        if typ == Typ.LINE:
            count += 1

    return count


def main():
    Opts.debug = "--debug" in sys.argv
    Opts.exclude = []

    use_covrc = "--covrc" in sys.argv

    if use_covrc:
        config = configparser.ConfigParser()
        config.read(".coveragerc")
        cfg = {s: dict(config.items(s)) for s in config.sections()}
        exclude = cfg.get("report", {}).get("omit", [])
        Opts.exclude = [f.strip() for f in exclude.split("\n") if f.strip()]

    for i in range(len(sys.argv)):
        if sys.argv[i] == "--exclude":
            Opts.exclude.append(sys.argv[i + 1])

    if Opts.debug and Opts.exclude:
        print("--exclude", Opts.exclude)

    print(count_lines(sys.stdin))


example = '''
diff --git a/cryptvfs.py b/cryptvfs.py
index c68429cf6..ee90ecea8 100755
--- a/cryptvfs.py
+++ b/cryptvfs.py
@@ -2,5 +2,17 @@

 from src.main import proc_entry

-if __name__ == "__main__":
-    proc_entry()
+
+
+class Foo:
+    """some docstring
+    """
+    # some comment
+    pass
+
+class Bar:
+    """some docstring
+    """
+    # some comment
+    def method():
+        line1 + 1
'''


def strio(s):
    import io

    return io.StringIO(s)


def test_basic():
    assert count_lines(strio(example)) == 10


def test_main(capsys):
    sys.argv = []
    sys.stdin = strio(example)
    main()
    cap = capsys.readouterr()
    print(cap.out)
    assert cap.out == "10\n"


def test_debug(capsys):
    sys.argv = ["--debug"]
    sys.stdin = strio(example)
    main()
    cap = capsys.readouterr()
    print(cap.out)
    assert Typ.DOCSTRING + '     """some docstring' in cap.out


def test_exclude(capsys):
    sys.argv = ["--exclude", "cryptvfs.py"]
    sys.stdin = strio(example)
    main()
    cap = capsys.readouterr()
    print(cap.out)
    assert cap.out == "0\n"


def test_covrc(capsys):
    sys.argv = ["--covrc"]
    sys.stdin = strio(example)
    main()
    cap = capsys.readouterr()
    print(cap.out)
    assert cap.out == "10\n"


if __name__ == "__main__":
    main()

那段代码可以轻松地修改为生成文件名，而不是计数。

但它当然也可能错误地将docstring的一部分误认为是“代码”（对于覆盖率等内容来说并不是）。

- Erik Aronesty

0

也许可以使用类似这样的Bash脚本：

#!/bin/bash
git diff --name-only "$@" | while read FPATH ; do
    LINES_COUNT=`git diff --textconv "$FPATH" "$@" | sed '/^[1-]\s\+[1-]\s\+.*/d' | wc -l`
    if [ $LINES_COUNT -gt 0 ] ; then
        echo -e "$LINES_COUNT\t$FPATH"
    fi
done | sort -n

- saeedgnu

0

我使用meld作为工具，通过设置它的选项来忽略注释，然后将meld用作差异工具：

git difftool --tool=meld -y

- buffy

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- phyatt · Accepted Answer

这是我使用的一个解决方案，目前效果很好。我写了一篇关于git (log|diff) -G<regex>选项的解决方案和一些缺失的文档。

基本上，它与之前的答案使用相同的解决方案，但专门针对以*或#开头的注释，有时在*之前还有一个空格... 但仍需要允许#ifdef，#include等更改。

前瞻和后顾似乎不受-G选项支持，一般情况下也不支持?，我在使用*时遇到了问题。然而，+似乎表现良好。

（注意，已在Git v2.7.0上进行测试）

多行注释版本

git diff -w -G'(^[^\*# /])|(^#\w)|(^\s+[^\*#/])'

-w 忽略空格
-G 仅显示与以下正则表达式匹配的差异行
(^[^\*# /]) 任何不以星号、井号或空格开头的行
(^#\w) 任何以 # 开头后跟一个字母的行
(^\s+[^\*#/]) 任何以一些空格开头后跟注释字符的行

基本上，现在每个文件的进出都被一个 SVN 钩子修改，并且修改了每个文件中的多行注释块。现在我可以将我的更改与 SVN 进行比较，而不会受到 SVN 在注释中添加的 FYI 信息的干扰。

从技术上讲，这将允许 Python 和 Bash 注释（如 #TODO）在差异中显示，如果 C++ 中的除法运算符从新行开始，则可以忽略它：

a = b
    / c;

此外，Git 中关于 -G 的文档似乎相当不足，因此这里的信息应该会有所帮助：

`git diff -G<regex>`

该命令用于在 Git 中查找所有与正则表达式匹配的行，并将其输出为差异。

-G<regex>

Look for differences whose patch text contains added/removed lines that match <regex>.

To illustrate the difference between -S<regex> --pickaxe-regex and -G<regex>, consider a commit with the following diff in the same file:
+    return !regexec(regexp, two->ptr, 1, &regmatch, 0);
...
-    hit = !regexec(regexp, mf2.ptr, 1, &regmatch, 0);
While git log -G"regexec\(regexp" will show this commit, git log -S"regexec\(regexp" --pickaxe-regex will not (because the number of occurrences of that string did not change).

See the pickaxe entry in gitdiffcore(7) for more information.

(注意，已测试在Git v2.7.0上)

-G使用基本正则表达式。
不支持?, *, !, {, } 正则表达式语法。
使用()进行分组和OR操作，需要用|。
支持使用通配符字符如\s, \W, 等等。
不支持前向和后向查找。
开头和结尾的行锚定^$有效。
此功能自Git 1.7.4以来一直可用。

排除文件与排除差异

请注意，-G选项会过滤将被比较的文件。

但是，如果一个文件被“比较”，那些在之前“排除/包括”的行将全部显示在差异中。

示例

仅显示至少提到foo的一行的文件差异。

git diff -G'foo'

显示除了以 # 开头的行之外的所有文件差异。

git diff -G'^[^#]'

显示具有 FIXME 或 TODO 差异的文件。

git diff -G`(FIXME)|(TODO)`

另请参阅 git log -G，git grep，git log -S，--pickaxe-regex和--pickaxe-all。

更新：-G选项使用哪个正则表达式工具？ https://github.com/git/git/search?utf8=%E2%9C%93&q=regcomp&type= https://github.com/git/git/blob/master/diffcore-pickaxe.c

if (opts & (DIFF_PICKAXE_REGEX | DIFF_PICKAXE_KIND_G)) {
    int cflags = REG_EXTENDED | REG_NEWLINE;
    if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE))
        cflags |= REG_ICASE;
    regcomp_or_die(&regex, needle, cflags);
    regexp = &regex;

// and in the regcom_or_die function
regcomp(regex, needle, cflags);

http://man7.org/linux/man-pages/man3/regexec.3.html

   REG_EXTENDED
          Use POSIX Extended Regular Expression syntax when interpreting
          regex.  If not set, POSIX Basic Regular Expression syntax is
          used.

// ...

   REG_NEWLINE
          Match-any-character operators don't match a newline.

          A nonmatching list ([^...])  not containing a newline does not
          match a newline.

          Match-beginning-of-line operator (^) matches the empty string
          immediately after a newline, regardless of whether eflags, the
          execution flags of regexec(), contains REG_NOTBOL.

          Match-end-of-line operator ($) matches the empty string
          immediately before a newline, regardless of whether eflags
          contains REG_NOTEOL.