我目前正在尝试处理一个文件(例如test1.jpg这样的图像文件),我需要列出该文件的所有重复项(按内容)。我已经尝试过fdupes,但它不允许基于输入文件进行检查。
简而言之:我需要一种通过内容列出特定文件的所有重复项的方法。
最好是通过命令行找到解决方案,但完整的应用程序也可以。
简而言之:我需要一种通过内容列出特定文件的所有重复项的方法。
最好是通过命令行找到解决方案,但完整的应用程序也可以。
$ md5sum path/to/file
e740926ec3fce151a68abfbdac3787aa path/to/file
(第一行是您需要执行的命令,第二行是该文件的MD5哈希值)
然后复制哈希值(在您的情况下可能会有所不同),并粘贴到下一个命令中:
$ find . -type f -print0 | xargs -0 md5sum | grep e740926ec3fce151a68abfbdac3787aa
e740926ec3fce151a68abfbdac3787aa ./path/to/file
e740926ec3fce151a68abfbdac3787aa ./path/to/other/file/with/same/content
....
$ find . -type f -print0 | xargs -0 md5sum | grep `md5sum path/to/file | cut -d " " -f 1`
e740926ec3fce151a68abfbdac3787aa ./path/to/file
e740926ec3fce151a68abfbdac3787aa ./path/to/other/file/with/same/content
....
ls -l路径/到/文件
来查看),那么这个命令会执行得更快。$ find . -type f -size 3952c -print0 | xargs -0 md5sum | grep e740926ec3fce151a68abfbdac3787aa
e740926ec3fce151a68abfbdac3787aa ./path/to/file
e740926ec3fce151a68abfbdac3787aa ./path/to/other/file/with/same/content
FILE=./path/to/file && find . -type f -size $(du -b $FILE | cut -f1)c -print0 | xargs -0 md5sum | grep $(md5sum $FILE | cut -f1 -d " ")
&&
和||
。bash-4.3$ diff /etc/passwd passwd_duplicate.txt > /dev/null && echo "SAME CONTENT" || echo "CONTENT DIFFERS"
SAME CONTENT
bash-4.3$ diff /etc/passwd TESTFILE.txt > /dev/null && echo "SAME CONTENT" || echo "CONTENT DIFFERS"
CONTENT DIFFERS
cd
命令进入该目录,然后使用for
循环,像这样:bash-4.3$ for file in * ; do diff /etc/passwd "$file" > /dev/null && echo "$file has same contents" || echo "$file has different contents"; done
also-waste.txt has different contents
directory_cleaner.py has different contents
dontdeletethisfile.txt has different contents
dont-delete.txt has different contents
important.txt has different contents
list.txt has different contents
neverdeletethis.txt has different contents
never-used-it.txt has different contents
passwd_dulicate.txt has same contents
find
命令遍历目录及其所有子目录(注意引号和所有适当的斜杠):bash-4.3$ find . -type f -exec sh -c 'diff /etc/passwd "{}" > /dev/null && echo "{} same" || echo "{} differs"' \;
./reallyimportantfile.txt differs
./dont-delete.txt differs
./directory_cleaner.py differs
./TESTFILE.txt differs
./dontdeletethisfile.txt differs
./neverdeletethis.txt differs
./important.txt differs
./passwd_dulicate.txt same
./this-can-be-deleted.txt differs
./also-waste.txt differs
./never-used-it.txt differs
./list.txt differs
md5sum
,并保存在一个变量中,例如md5
。md5=$(md5sum file.txt | awk '{print $1}')
find
来遍历所需的目录树,并检查是否有任何文件具有相同的哈希值,如果是,则打印文件名:find . -type f -exec sh -c '[ "$(md5sum "$1" | awk "{print \$1}")" = "$2" ] \
&& echo "$1"' _ {} "$md5" \;
find . -type f
在当前目录中查找所有文件,根据需要更改目录
-exec
谓词在找到的所有文件上执行命令 sh -c ...
在 sh -c
中,_
是 $0
的占位符,$1
是找到的文件,$2
是 $md5
[ $(md5sum "$1"|awk "{print \$1}") = "$2" ] && echo "$1"
如果文件的哈希值与我们检查重复项的哈希值相同,则打印文件名
示例:
% md5sum ../foo.txt bar.txt
d41d8cd98f00b204e9800998ecf8427e ../foo.txt
d41d8cd98f00b204e9800998ecf8427e bar.txt
% md5=$(md5sum ../foo.txt | awk '{print $1}')
% find . -type f -exec sh -c '[ "$(md5sum "$1" | awk "{print \$1}")" = "$2" ] && echo "$1"' _ {} "$md5" \;
bar.txt
md5sum
的-c
选项,只需对其输入流进行一些操作即可。以下命令不会递归执行,只能在当前工作目录中运行。将original_file
替换为您想要检查重复项的文件名。(hash=$(md5sum original_file) ; for f in ./* ; do echo "${hash%% *} ${f}" | if md5sum -c --status 2>/dev/null ; then echo "$f is a duplicate" ; fi ; done)
for f in ./*
部分替换为 for f in /directory/path/*
来搜索不同的目录。(shopt -s globstar; hash=$(md5sum original_file); for f in ./** ; do echo "${hash%% *} ${f}" | if md5sum -c --status 2>/dev/null; then echo "$f is a duplicate"; fi; done)
./file is a duplicate
的重复文件名。它们都被封装在括号中,以避免在命令本身之外设置哈希变量或globstar shell选项。该命令可以使用其他哈希算法,比如将两个md5sum
替换为sha256sum
即可实现。~/bin/find-dups
或者甚至/usr/local/bin/find-dups
中,然后使用chmod +x
使其可执行。它需要安装Ruby,但除此之外没有其他依赖项。#!/usr/bin/env ruby
require 'digest/md5'
require 'fileutils'
require 'optparse'
def glob_from_argument(arg)
if File.directory?(arg)
arg + '/**/*'
elsif File.file?(arg)
arg
else # it's already a glob
arg
end
end
# Wrap text at 80 chars. (configurable)
def wrap_text(*args)
width = args.last.is_a?(Integer) ? args.pop : 80
words = args.flatten.join(' ').split(' ')
if words.any? { |word| word.size > width }
raise NotImplementedError, 'cannot deal with long words'
end
lines = []
line = []
until words.empty?
word = words.first
if line.size + line.map(&:size).inject(0, :+) + word.size > width
lines << line.join(' ')
line = []
else
line << words.shift
end
end
lines << line.join(' ') unless line.empty?
lines.join("\n")
end
ALLOWED_PRINT_OPTIONS = %w(hay needle separator)
def parse_options(args)
options = {}
options[:print] = %w(hay needle)
opt_parser = OptionParser.new do |opts|
opts.banner = "Usage: #{$0} [options] HAYSTACK NEEDLES"
opts.separator ''
opts.separator 'Search for duplicate files (needles) in a directory (the haystack).'
opts.separator ''
opts.separator 'HAYSTACK should be the directory (or one file) that you want to search in.'
opts.separator ''
opts.separator wrap_text(
'NEEDLES are the files you want to search for.',
'A NEEDLE can be a file or a directory,',
'in which case it will be recursively scanned.',
'Note that NEEDLES may overlap the HAYSTACK.')
opts.separator ''
opts.on("-p", "--print PROPERTIES", Array,
"When a match is found, print needle, or",
"hay, or both. PROPERTIES is a comma-",
"separated list with one or more of the",
"words 'needle', 'hay', or 'separator'.",
"'separator' prints an empty line.",
"Default: 'needle,hay'") do |list|
options[:print] = list
end
opts.on("-v", "--[no-]verbose", "Run verbosely") do |v|
options[:verbose] = v
end
opts.on_tail("-h", "--help", "Show this message") do
puts opts
exit
end
end
opt_parser.parse!(args)
options[:haystack] = ARGV.shift
options[:needles] = ARGV.shift(ARGV.size)
raise ArgumentError, "Missing HAYSTACK" if options[:haystack].nil?
raise ArgumentError, "Missing NEEDLES" if options[:needles].empty?
unless options[:print].all? { |option| ALLOWED_PRINT_OPTIONS.include? option }
raise ArgumentError, "Allowed print options are 'needle', 'hay', 'separator'"
end
options
rescue OptionParser::InvalidOption, ArgumentError => error
puts error, nil, opt_parser.banner
exit 1
end
options = parse_options(ARGV)
VERBOSE = options[:verbose]
PRINT_HAY = options[:print].include? 'hay'
PRINT_NEEDLE = options[:print].include? 'needle'
PRINT_SEPARATOR = options[:print].include? 'separator'
HAYSTACK_GLOB = glob_from_argument options[:haystack]
NEEDLES_GLOB = options[:needles].map { |arg| glob_from_argument(arg) }
def info(*strings)
return unless VERBOSE
STDERR.puts strings
end
def info_with_ellips(string)
return unless VERBOSE
STDERR.print string + '... '
end
def all_files(*globs)
globs
.map { |glob| Dir.glob(glob) }
.flatten
.map { |filename| File.expand_path(filename) } # normalize filenames
.uniq
.sort
.select { |filename| File.file?(filename) }
end
def index_haystack(glob)
all_files(glob).group_by { |filename| File.size(filename) }
end
@checksums = {}
def checksum(filename)
@checksums[filename] ||= calculate_checksum(filename)
end
def calculate_checksum(filename)
Digest::MD5.hexdigest(File.read(filename))
end
def find_needle(needle, haystack)
straws = haystack[File.size(needle)] || return
checksum_needle = calculate_checksum(needle)
straws.detect do |straw|
straw != needle &&
checksum(straw) == checksum_needle &&
FileUtils.identical?(needle, straw)
end
end
BOLD = "\033[1m"
NORMAL = "\033[22m"
def print_found(needle, found)
if PRINT_NEEDLE
print BOLD if $stdout.tty?
puts needle
print NORMAL if $stdout.tty?
end
puts found if PRINT_HAY
puts if PRINT_SEPARATOR
end
info "Searching #{HAYSTACK_GLOB} for files equal to #{NEEDLES_GLOB}."
info_with_ellips 'Indexing haystack by file size'
haystack = index_haystack(HAYSTACK_GLOB)
haystack[0] = nil # ignore empty files
info "#{haystack.size} files"
info 'Comparing...'
all_files(*NEEDLES_GLOB).each do |needle|
info " examining #{needle}"
found = find_needle(needle, haystack)
print_found(needle, found) unless found.nil?
end