如何在Linux上计算两个文件之间的差异？

Question

如何在Linux上计算两个文件之间的差异？

56

我需要处理大文件，并且必须找出其中的差异。我不需要不同的位，而是需要差异的数量。

为了找出不同行的数量，我想到了以下方法：

diff --suppress-common-lines --speed-large-files -y File1 File2 | wc -l

它可以工作，但有更好的方法吗？

如何使用标准工具（如bash、diff、awk、sed和某些旧版本的perl）计算精确的差异数量？

- Zsolt Botykai

问题中哪里说他想要计算行之间的差异，而不是字符之间的差异？我看到了“位”和“确切的差异数量”，但“行”只是他试图这样做的方式。 - vstepaniuk

7个回答

49

diff -U 0 file1 file2 | grep -v ^@ | wc -l

这是顶部两个文件名导致diff清单中减去2。统一格式可能比并排格式稍快。

- John Kugelman

7

根据我的“工作”定义，这不起作用。http://pastie.org/pastes/3179433/text 每个文件中只有一个字符，数字“4”与什么有关？ - Stop Slandering Monica Cellio

5

这取决于你如何计算差异。在这个例子中pastie.org/5553254，我认为有两行不同，即我同意Sequoia McDowell的看法。由于打印了2个diff文件，必须从结果中减去2，这也很不方便。因此，我认为Josh的回答是正确的。可以使用grep的-c（统计）选项来缩短它，而不是通过管道传递给wc -l，像这样：diff -U 0 file1 file2 | grep -c ^@。 - Henrik Warne

diff -U 0 file1 file2 | grep -v ^@ | tail -n +3 | wc -l 应该给出正确的计数。它排除了 diff 输出顶部的文件名。 - Matt Kneiser

6

正确的解决方案在这里：https://unix.stackexchange.com/questions/53719/get-correct-number-of-lines-in-diff-output，作为被接受的答案。 - tsusanka

看起来原问题并不是在寻找这种计数方式或者Josh的计数方式，鉴于问题中的示例代码和“查找不同行数”的要求。尽管我猜他们接受了这个答案！ - Neal Gokli

显示剩余2条评论

5

如果使用Linux/Unix系统，可以考虑使用comm -1 file1 file2命令来打印在file1中而不在file2中的行，使用comm -1 file1 file2 | wc -l命令来计算这些行数，类似地，还可以使用comm -2 ...。

- dubiousjim

1

正如sureshw在另一个答案中指出的那样，comm希望其参数是排序文件。因此，这个建议只能在特殊情况下依赖。（我认为很容易使用awk编写自己的版本的comm，它也适用于未排序的输入，但怀疑这不再满足原始问题的精神。） - dubiousjim

5

由于每个不同的输出行都以 < 或 > 字符开头，我建议这样做：

diff file1 file2 | grep ^[\>\<] | wc -l

只使用脚本行中的\<或\>，您可以仅计算一个文件中的差异。

- Michal Nemec

这会将行数计算两次，因为“<”和“>”可能会打印在同一行。 - Vladislavs Dovgalecs

4

我认为正确的解决方案在这个答案中，即：

$ diff -y --suppress-common-lines a b | grep '^' | wc -l
1

- tsusanka

0

如果你正在处理内容类似的文件，这些文件应该按照相同的行排序（例如描述相似事物的CSV文件），并且你想在以下文件中找到2个差异：

File a:    File b:
min,max    min,max
1,5        2,5
3,4        3,4
-2,10      -1,1

你可以用Python来实现它：

different_lines = 0
with open(file1) as a, open(file2) as b:
    for line in a:
        other_line = b.readline()
        if line != other_line:
            different_lines += 1

- Daniel Lee

0

这里有一种方法可以计算两个文件之间的任何差异，使用指定的正则表达式来表示这些差异 - 这里使用.代表除换行符以外的任何字符：

git diff --patience --word-diff=porcelain --word-diff-regex=. file1 file2 | pcre2grep -M "^@[\s\S]*" | pcre2grep -M --file-offsets "(^-.*\n)(^\+.*\n)?|(^\+.*\n)" | wc -l

man git-diff 的一部分：

--patience
           Generate a diff using the "patience diff" algorithm.
--word-diff[=<mode>]
           Show a word diff, using the <mode> to delimit changed words. By default, words are delimited by whitespace; see --word-diff-regex below.
           porcelain
               Use a special line-based format intended for script consumption. Added/removed/unchanged runs are printed in the usual unified diff
               format, starting with a +/-/` ` character at the beginning of the line and extending to the end of the line. Newlines in the input
               are represented by a tilde ~ on a line of its own.
--word-diff-regex=<regex>
           Use <regex> to decide what a word is, instead of considering runs of non-whitespace to be a word. Also implies --word-diff unless it
           was already enabled.
           Every non-overlapping match of the <regex> is considered a word. Anything between these matches is considered whitespace and ignored(!)
           for the purposes of finding differences. You may want to append |[^[:space:]] to your regular expression to make sure that it matches
           all non-whitespace characters. A match that contains a newline is silently truncated(!) at the newline.
           For example, --word-diff-regex=.  will treat each character as a word and, correspondingly, show differences character by character.

pcre2grep 是 Ubuntu 20.04 上 pcre2-utils 软件包的一部分。

- vstepaniuk

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Josh · Accepted Answer

52

如果你想要计算不同行的数量，请使用以下代码：

diff -U 0 file1 file2 | grep ^@ | wc -l

约翰的回答是否重复计算了不同的行？

- Josh

是的，它会重复计算。请查看我对被接受答案的评论。这个答案中的命令是正确的。 - Henrik Warne

2

这对我来说似乎有可能会重复计算行数，无论是在MacOSX还是Ubuntu上。连续的一批行可以被分组成一个块，这取决于您的任务是否应该是一个差异还是几个差异。 - Michael H.

不要忘记，彩色输出意味着行以转义序列开头！我不得不使用 hexdump 才能弄清楚。 - James Morris

11

正如@khedron所指出的，连续的行可以被分组为一个块。据我推测，这意味着该方法容易出现漏计的情况。 - user533832

6

您可以使用 grep -c ^@ 替换 grep ^@ | wc -l，意思相同。 - Shiplu Mokaddim

8

“Prone to undercounting” 这个说法还算委婉——在两个完全不同的文件上运行此命令，结果可能只有1。 - nemetroid