如何使用Bash脚本查找一个文件中存在而另一个文件中不存在的行？

Question

如何使用Bash脚本查找一个文件中存在而另一个文件中不存在的行？

15

想象一下文件1：

#include "first.h"
#include "second.h"
#include "third.h"

// more code here
...

想象一下文件2：

#include "fifth.h"
#include "second.h"
#include "eigth.h"

// more code here
...

我想获取包含在文件2中但不包含在文件1中的头部信息，仅限这些行。因此，运行时，文件1和文件2的差异将产生：

#include "fifth.h"
#include "eigth.h"

我知道如何在Perl/Python/Ruby中做到这一点，但我想在不使用其他编程语言的情况下实现它。

- Senthess

1

要了解更多执行相同操作的方法，请查看此BashFAQ。请记住，由于所有这些解决方案都是基于行的模式匹配，因此您必须确保在任何地方以相同的方式格式化包含行。例如：#include将不匹配# include，而"first.h"将无法与子目录中的"../first.h"匹配等。 - jw013

可能是从另一个文件中删除出现的行的重复问题。 - Ciro Santilli OurBigBook.com

5个回答

9

如果可以使用临时文件，可以尝试这个方法：

grep include file1.h > /tmp/x && grep -f /tmp/x -v file2.h | grep include

这段代码实现的功能是：

从file1.h中提取所有的引用，并将它们写入文件/tmp/x
使用该文件获取在file2.h中未被包含在此列表中的所有行
提取余下的file2.h中的所有引用

尽管如此，它可能无法正确处理空格等差异。

编辑：为了防止误报，可以对最后一次grep使用不同的模式（感谢jw013提供的建议）：

grep include file1.h > /tmp/x && grep -f /tmp/x -v file2.h | grep "^#include"

- Frank Schmitt

1

也许将最后一个grep模式更改为'^#include'，除非您还想看到您偶尔使用单词“include”时的随机代码行。 - jw013

1

当使用grep查找匹配行时，应使用选项：-F表示“固定字符串”（非正则表达式）模式，-x表示“整行”匹配。此外，临时文件并不是必需的，您可以使用-f -从标准输入中获取模式文件。最终的命令变为：grep '^#include' file1.h | grep -f - -vFx file2.h | grep '^#include'。 - Lee

8

这种变体需要使用带有-f选项的fgrep。GNU grep（即任何Linux系统以及其他一些系统）应该可以很好地工作。

# Find occurrences of '#include' in file1.h
fgrep '#include' file1.h |
# Remove any identical lines from file2.h
fgrep -vxf - file2.h |
# Result is all lines not present in file1.h.  Out of those, extract #includes
fgrep '#include'

这不需要任何排序，也不需要任何显式的临时文件。理论上，fgrep -f 可能会在幕后使用临时文件，但我相信GNU fgrep并不会。

- tripleee

POSIX规定了-f，因此任何符合POSIX标准的grep都应该具备它。 - jw013

6

如果不仅使用Bash（即，使用外部程序是可以接受的），则可以使用moreutils中的combine：

combine file1 not file2 > lines_in_file1_not_in_file2

- pmocek

2

将$file1和$file2文件的内容合并，筛选出包含'#include'的行，并按字母顺序排序，去重后输出。

- plbogen

这将列出仅属于file1或file2的#include行。我认为你想要cat $file1 $file1 $file2 | grep '#include' | sort | uniq -u，其中file1被重复以使其#include行加倍，并随后通过uniq -u进行过滤。 - esmit

由于grep可以读取多个输入文件，因此您可以使用grep -h并摆脱（仅有些许无用的）cat。 - tripleee

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- glenn jackman · Accepted Answer

这是一个一行代码，但它不保留顺序:

comm -13 <(grep '#include' file1 | sort) <(grep '#include' file2 | sort)

如果您需要保留顺序：

awk '
  !/#include/ {next} 
  FILENAME == ARGV[1] {include[$2]=1; next} 
  !($2 in include)
' file1 file2