按内容查找文件的重复项

Question

按内容查找文件的重复项

command-lineduplicate

9

我目前正在尝试处理一个文件（例如test1.jpg这样的图像文件），我需要列出该文件的所有重复项（按内容）。我已经尝试过fdupes，但它不允许基于输入文件进行检查。

简而言之：我需要一种通过内容列出特定文件的所有重复项的方法。

最好是通过命令行找到解决方案，但完整的应用程序也可以。

- GamrCorps

2新来的用户，请根据需要更新标签。谢谢。 - GamrCorps

可能是重复的问题：如何查找（和删除）重复文件。 - Gilles 'SO- stop being evil'

@Gilles 这两个问题的区别在于，这个问题将所有内容与参考文件进行比较，而另一个问题则找出所有的重复项。 - GamrCorps

6个回答

4

你可以在Python中使用filecmp。

例如：

import filecmp 
print filecmp.cmp('filename.png', 'filename.png')

如果相等，将打印“True”，否则打印“False”。

- Benny

1好的，学到了新东西。 - Sergiy Kolodyazhnyy

4

使用diff命令，结合布尔运算符&&和||。

bash-4.3$ diff /etc/passwd passwd_duplicate.txt > /dev/null && echo "SAME CONTENT" || echo "CONTENT DIFFERS"
SAME CONTENT

bash-4.3$ diff /etc/passwd TESTFILE.txt > /dev/null && echo "SAME CONTENT" || echo "CONTENT DIFFERS"
CONTENT DIFFERS

如果你想遍历特定目录中的多个文件，先使用cd命令进入该目录，然后使用for循环，像这样：

bash-4.3$ for file in * ; do  diff /etc/passwd "$file" > /dev/null && echo "$file has same contents" || echo "$file has different contents"; done
also-waste.txt has different contents
directory_cleaner.py has different contents
dontdeletethisfile.txt has different contents
dont-delete.txt has different contents
important.txt has different contents
list.txt has different contents
neverdeletethis.txt has different contents
never-used-it.txt has different contents
passwd_dulicate.txt has same contents

对于递归情况，请使用find命令遍历目录及其所有子目录（注意引号和所有适当的斜杠）：

bash-4.3$ find . -type f -exec sh -c 'diff /etc/passwd "{}" > /dev/null &&  echo "{} same" || echo "{} differs"' \;
./reallyimportantfile.txt differs
./dont-delete.txt differs
./directory_cleaner.py differs
./TESTFILE.txt differs
./dontdeletethisfile.txt differs
./neverdeletethis.txt differs
./important.txt differs
./passwd_dulicate.txt same
./this-can-be-deleted.txt differs
./also-waste.txt differs
./never-used-it.txt differs
./list.txt differs

- Sergiy Kolodyazhnyy

3

获取所讨论文件的md5sum，并保存在一个变量中，例如md5。

md5=$(md5sum file.txt | awk '{print $1}')

使用find来遍历所需的目录树，并检查是否有任何文件具有相同的哈希值，如果是，则打印文件名：

find . -type f -exec sh -c '[ "$(md5sum "$1" | awk "{print \$1}")" = "$2" ] \
                             && echo "$1"' _ {} "$md5" \;

find . -type f 在当前目录中查找所有文件，根据需要更改目录
-exec 谓词在找到的所有文件上执行命令 sh -c ...
在 sh -c 中，_ 是 $0 的占位符，$1 是找到的文件，$2 是 $md5
[ $(md5sum "$1"|awk "{print \$1}") = "$2" ] && echo "$1" 如果文件的哈希值与我们检查重复项的哈希值相同，则打印文件名

示例：

% md5sum ../foo.txt bar.txt 
d41d8cd98f00b204e9800998ecf8427e  ../foo.txt
d41d8cd98f00b204e9800998ecf8427e  bar.txt

% md5=$(md5sum ../foo.txt | awk '{print $1}')

% find . -type f -exec sh -c '[ "$(md5sum "$1" | awk "{print \$1}")" = "$2" ] && echo "$1"' _ {} "$md5" \;
bar.txt

- heemayl

2

可以在命令行上使用md5sum的-c选项，只需对其输入流进行一些操作即可。以下命令不会递归执行，只能在当前工作目录中运行。将original_file替换为您想要检查重复项的文件名。

(hash=$(md5sum original_file) ; for f in ./* ; do echo "${hash%% *} ${f}" | if md5sum -c --status 2>/dev/null ; then echo "$f is a duplicate" ; fi ; done)

你可以将 for f in ./* 部分替换为 for f in /directory/path/* 来搜索不同的目录。

如果您希望搜索递归进行，可以设置 shell 选项 'globstar'，并在 for 循环中给定的模式中使用两个星号：

(shopt -s globstar; hash=$(md5sum original_file); for f in ./** ; do echo "${hash%% *} ${f}" | if md5sum -c --status 2>/dev/null; then echo "$f is a duplicate"; fi; done)

这两个命令的版本只会输出带有语句./file is a duplicate的重复文件名。它们都被封装在括号中，以避免在命令本身之外设置哈希变量或globstar shell选项。该命令可以使用其他哈希算法，比如将两个md5sum替换为sha256sum即可实现。

- Arronical

1

@smurf和@heemayl肯定是正确的，但我发现在我的情况下速度比我想要的慢；我只是有太多的文件要处理。因此，我写了一个小的命令行工具，我认为可能也能帮到你（https://github.com/tijn/dupfinder; ruby; 没有外部依赖）。

基本上，我的脚本推迟了哈希计算：只有在文件大小匹配时才执行计算。既然我知道我正在搜索一个4 KB的jpg文件，为什么要通过哈希算法流式传输几个多GB的MP4或iso文件的内容呢？脚本的其余部分主要是输出格式化。 编辑：（感谢@Serg）这是整个脚本的源代码。你应该将它保存在~/bin/find-dups或者甚至/usr/local/bin/find-dups中，然后使用chmod +x使其可执行。它需要安装Ruby，但除此之外没有其他依赖项。

#!/usr/bin/env ruby

require 'digest/md5'
require 'fileutils'
require 'optparse'

def glob_from_argument(arg)
  if File.directory?(arg)
    arg + '/**/*'
  elsif File.file?(arg)
    arg
  else # it's already a glob
    arg
  end
end

# Wrap text at 80 chars. (configurable)
def wrap_text(*args)
  width = args.last.is_a?(Integer) ? args.pop : 80
  words = args.flatten.join(' ').split(' ')
  if words.any? { |word| word.size > width }
    raise NotImplementedError, 'cannot deal with long words'
  end

  lines = []
  line = []
  until words.empty?
    word = words.first
    if line.size + line.map(&:size).inject(0, :+) + word.size > width
      lines << line.join(' ')
      line = []
    else
      line << words.shift
    end
  end
  lines << line.join(' ') unless line.empty?
  lines.join("\n")
end

ALLOWED_PRINT_OPTIONS = %w(hay needle separator)

def parse_options(args)
  options = {}
  options[:print] = %w(hay needle)

  opt_parser = OptionParser.new do |opts|
    opts.banner = "Usage: #{$0} [options] HAYSTACK NEEDLES"
    opts.separator ''
    opts.separator 'Search for duplicate files (needles) in a directory (the haystack).'
    opts.separator ''
    opts.separator 'HAYSTACK should be the directory (or one file) that you want to search in.'
    opts.separator ''
    opts.separator wrap_text(
      'NEEDLES are the files you want to search for.',
      'A NEEDLE can be a file or a directory,',
      'in which case it will be recursively scanned.',
      'Note that NEEDLES may overlap the HAYSTACK.')
    opts.separator ''

    opts.on("-p", "--print PROPERTIES", Array,
      "When a match is found, print needle, or",
      "hay, or both. PROPERTIES is a comma-",
      "separated list with one or more of the",
      "words 'needle', 'hay', or 'separator'.",
      "'separator' prints an empty line.",
      "Default: 'needle,hay'") do |list|
      options[:print] = list
    end

    opts.on("-v", "--[no-]verbose", "Run verbosely") do |v|
      options[:verbose] = v
    end

    opts.on_tail("-h", "--help", "Show this message") do
      puts opts
      exit
    end
  end
  opt_parser.parse!(args)

  options[:haystack] = ARGV.shift
  options[:needles] = ARGV.shift(ARGV.size)

  raise ArgumentError, "Missing HAYSTACK" if options[:haystack].nil?
  raise ArgumentError, "Missing NEEDLES" if options[:needles].empty?
  unless options[:print].all? { |option| ALLOWED_PRINT_OPTIONS.include? option }
    raise ArgumentError, "Allowed print options are  'needle', 'hay', 'separator'"
  end

  options
rescue OptionParser::InvalidOption, ArgumentError => error
  puts error, nil, opt_parser.banner
  exit 1
end

options = parse_options(ARGV)

VERBOSE = options[:verbose]
PRINT_HAY = options[:print].include? 'hay'
PRINT_NEEDLE = options[:print].include? 'needle'
PRINT_SEPARATOR = options[:print].include? 'separator'

HAYSTACK_GLOB = glob_from_argument options[:haystack]
NEEDLES_GLOB = options[:needles].map { |arg| glob_from_argument(arg) }

def info(*strings)
  return unless VERBOSE
  STDERR.puts strings
end

def info_with_ellips(string)
  return unless VERBOSE
  STDERR.print string + '... '
end

def all_files(*globs)
  globs
    .map { |glob| Dir.glob(glob) }
    .flatten
    .map { |filename| File.expand_path(filename) } # normalize filenames
    .uniq
    .sort
    .select { |filename| File.file?(filename) }
end

def index_haystack(glob)
  all_files(glob).group_by { |filename| File.size(filename) }
end

@checksums = {}
def checksum(filename)
  @checksums[filename] ||= calculate_checksum(filename)
end

def calculate_checksum(filename)
  Digest::MD5.hexdigest(File.read(filename))
end

def find_needle(needle, haystack)
  straws = haystack[File.size(needle)] || return

  checksum_needle = calculate_checksum(needle)
  straws.detect do |straw|
    straw != needle &&
      checksum(straw) == checksum_needle &&
      FileUtils.identical?(needle, straw)
  end
end

BOLD = "\033[1m"
NORMAL = "\033[22m"

def print_found(needle, found)
  if PRINT_NEEDLE
    print BOLD if $stdout.tty?
    puts needle
    print NORMAL if $stdout.tty?
  end
  puts found if PRINT_HAY
  puts if PRINT_SEPARATOR
end

info "Searching #{HAYSTACK_GLOB} for files equal to #{NEEDLES_GLOB}."

info_with_ellips 'Indexing haystack by file size'
haystack = index_haystack(HAYSTACK_GLOB)
haystack[0] = nil # ignore empty files
info "#{haystack.size} files"

info 'Comparing...'
all_files(*NEEDLES_GLOB).each do |needle|
  info "  examining #{needle}"
  found = find_needle(needle, haystack)
  print_found(needle, found) unless found.nil?
end

- Tijn

1虽然GitHub是一个备受尊重和信任的网站，但建议将源代码放入答案中，以使答案能够自给自足。请参考这个讨论：http://meta.askubuntu.com/q/15743/295286 - Sergiy Kolodyazhnyy

似乎有点复杂，但为什么要止步于此呢？你甚至可以检查前X个字符，以删除文件的一部分（尤其是当你在寻找一个大文件时）。 - AxelH

@AxelH 是的，我确实有一些代码可以做到这一点，但结果证明它比没有这个功能要慢。不过我同意这个功能非常有用，特别是对于大文件来说，所以也许我应该把它重新加回去，并且可以通过一些命令行标志来选择是否启用它。（如果其他人想写的话，欢迎提交拉取请求）现在我想想，你甚至可以选择使用哪些策略以及顺序。 :-) - Tijn

- sмurf · Accepted Answer

首先找到您文件的MD5哈希值：

$ md5sum path/to/file
e740926ec3fce151a68abfbdac3787aa  path/to/file

（第一行是您需要执行的命令，第二行是该文件的MD5哈希值）

然后复制哈希值（在您的情况下可能会有所不同），并粘贴到下一个命令中：

$ find . -type f -print0 | xargs -0 md5sum | grep e740926ec3fce151a68abfbdac3787aa
e740926ec3fce151a68abfbdac3787aa  ./path/to/file
e740926ec3fce151a68abfbdac3787aa  ./path/to/other/file/with/same/content
....

如果你想要变得花哨一点，你可以将两个命令合并成一个：

$ find . -type f -print0 | xargs -0 md5sum | grep `md5sum path/to/file | cut -d " " -f 1`
e740926ec3fce151a68abfbdac3787aa  ./path/to/file
e740926ec3fce151a68abfbdac3787aa  ./path/to/other/file/with/same/content
....

你可以使用sha1或其他高级哈希算法。如果使用情况是搜索"几个多GB的MP4或iso文件"以找到"4KB的jpg"（根据@Tijn的回答），那么指定文件大小将大大加快速度。如果你要查找的文件大小恰好为3952字节（可以通过使用ls -l路径/到/文件来查看），那么这个命令会执行得更快。

$ find . -type f -size 3952c -print0 | xargs -0 md5sum | grep e740926ec3fce151a68abfbdac3787aa
e740926ec3fce151a68abfbdac3787aa  ./path/to/file
e740926ec3fce151a68abfbdac3787aa  ./path/to/other/file/with/same/content

请注意尺寸后面的额外 c ，表示字符/字节。

如果您希望，可以将其合并为一个单一命令：

FILE=./path/to/file && find . -type f -size $(du -b $FILE | cut -f1)c -print0 | xargs -0 md5sum | grep $(md5sum $FILE | cut -f1 -d " ")