查找重复文件并删除它们。

83

我正在编写一个Python程序,用于查找并从文件夹中删除重复文件。

我有多个mp3文件的副本以及一些其他文件。我正在使用sh1算法。

如何找到这些重复文件并将它们删除?


1
有兴趣的人可以看一下,我写了一个带有图形界面的程序,它可以根据文件大小和二进制块进行比较:https://github.com/nav9/duplicateFileFinder - Nav
只想说算法应该依赖于文件属性、文件内容哈希或图像相似性之一。基于这个,程序会有所不同。 - Rajesh Swarnkar
10个回答

125

最快算法 - 比起已接受的答案性能提升了100倍(真的 :))

其他解决方案中的方法很酷,但它们忽略了重复文件的一个重要属性 - 它们具有相同的文件大小。只在具有相同大小的文件上计算昂贵的哈希值将节省大量CPU;在最后的性能比较中,这里有解释。

在@nosklo提供的可靠答案的基础上进行迭代,并借鉴@Raffi的快速哈希文件开头的想法,并且只有在快速哈希中出现冲突时才计算完整的哈希,以下是步骤:

  1. 建立一个文件哈希表,其中文件大小是关键字。
  2. 对于大小相同的文件,创建一个哈希表,其中包含其前1024个字节的哈希值; 不冲突的元素是唯一的。
  3. 对于具有相同前1k字节哈希的文件,在完整内容上计算哈希 - 具有匹配哈希的文件不是唯一的。

代码如下:

#!/usr/bin/env python3
from collections import defaultdict
import hashlib
import os
import sys


def chunk_reader(fobj, chunk_size=1024):
    """Generator that reads a file in chunks of bytes"""
    while True:
        chunk = fobj.read(chunk_size)
        if not chunk:
            return
        yield chunk


def get_hash(filename, first_chunk_only=False, hash=hashlib.sha1):
    hashobj = hash()
    file_object = open(filename, 'rb')

    if first_chunk_only:
        hashobj.update(file_object.read(1024))
    else:
        for chunk in chunk_reader(file_object):
            hashobj.update(chunk)
    hashed = hashobj.digest()

    file_object.close()
    return hashed


def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes_by_size = defaultdict(list)  # dict of size_in_bytes: [full_path_to_file1, full_path_to_file2, ]
    hashes_on_1k = defaultdict(list)  # dict of (hash1k, size_in_bytes): [full_path_to_file1, full_path_to_file2, ]
    hashes_full = {}   # dict of full_file_hash: full_path_to_file_string

    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            # get all files that have the same size - they are the collision candidates
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                try:
                    # if the target is a symlink (soft one), this will 
                    # dereference it - change the value to the actual target file
                    full_path = os.path.realpath(full_path)
                    file_size = os.path.getsize(full_path)
                    hashes_by_size[file_size].append(full_path)
                except (OSError,):
                    # not accessible (permissions, etc) - pass on
                    continue


    # For all files with the same file size, get their hash on the 1st 1024 bytes only
    for size_in_bytes, files in hashes_by_size.items():
        if len(files) < 2:
            continue    # this file size is unique, no need to spend CPU cycles on it

        for filename in files:
            try:
                small_hash = get_hash(filename, first_chunk_only=True)
                # the key is the hash on the first 1024 bytes plus the size - to
                # avoid collisions on equal hashes in the first part of the file
                # credits to @Futal for the optimization
                hashes_on_1k[(small_hash, size_in_bytes)].append(filename)
            except (OSError,):
                # the file access might've changed till the exec point got here 
                continue

    # For all files with the hash on the 1st 1024 bytes, get their hash on the full file - collisions will be duplicates
    for __, files_list in hashes_on_1k.items():
        if len(files_list) < 2:
            continue    # this hash of fist 1k file bytes is unique, no need to spend cpy cycles on it

        for filename in files_list:
            try: 
                full_hash = get_hash(filename, first_chunk_only=False)
                duplicate = hashes_full.get(full_hash)
                if duplicate:
                    print("Duplicate found: {} and {}".format(filename, duplicate))
                else:
                    hashes_full[full_hash] = filename
            except (OSError,):
                # the file access might've changed till the exec point got here 
                continue


if __name__ == "__main__":
    if sys.argv[1:]:
        check_for_duplicates(sys.argv[1:])
    else:
        print("Please pass the paths to check as parameters to the script")

接下来是有趣的部分-性能比较。

基准 -

  • 一个包含1047个文件(32个mp4,1015个jpg),总大小为5445.998 MiB的目录,即我的手机相机自动上传目录 :)
  • 小型(但完全功能)处理器-1600 BogoMIPS,1.2 GHz 32L1 + 256L2 Kbs缓存,/proc/cpuinfo:

处理器 :Feroceon 88FR131 rev 1 (v5l) BogoMIPS:1599.07

(即我的低端NAS :),运行Python 2.7.11。

因此,@nosklo的非常方便的解决方案的输出如下:

root@NAS:InstantUpload# time ~/scripts/checkDuplicates.py 
Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg
Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg
Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg
Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg

real    5m44.198s
user    4m44.550s
sys     0m33.530s

这是带有大小检查过滤器的版本,先使用小哈希值,最后使用完整哈希值如果发现碰撞:

root@NAS:InstantUpload# time ~/scripts/checkDuplicatesSmallHash.py . "/i-data/51608399/photo/Todor phone"
Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg
Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg
Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg
Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg

real    0m1.398s
user    0m1.200s
sys     0m0.080s

每个版本都运行了3次,以获取所需时间的平均值。

因此,v1是(user+sys)284秒,而另一个版本是2秒。差别很大,是吧 :)

由于这种增加,可以转向SHA512,甚至更高级 - 不需要进行的计算量将减轻性能惩罚。

负面影响:

  • 比其他版本更多磁盘访问 - 每个文件都会访问一次以获取大小统计信息(这很便宜,但仍然是磁盘IO),并且每个副本都会被打开两次(用于小的第一个1k字节哈希和完整内容哈希)
  • 由于存储哈希表运行时,将消耗更多内存

9
我已更新此 Python3 脚本并进行了一些小的代码优化:https://gist.github.com/tfeldmann/fc875e6630d11f2256e746f67a09c1ae - tfeldmann
3
因为字典 hashes_on_1khashes_full 比较不同大小的文件的哈希值,所以会增加误报的风险。应该使用元组(大小,哈希值)作为键,而不是仅使用哈希值。如果有数百万个文件,则风险并不可忽略。 - Futal
3
@Futal 不错的建议!我已经相应地更新了 https://gist.github.com/tfeldmann/fc875e6630d11f2256e746f67a09c1ae。 - tfeldmann
3
感谢更新的代码。文章标题为“查找重复文件并删除它们”。 - prismspecs
2
@YoYoYo,我理解你的观点,确实在大量相对较大的文件集合中,如果字节大小的冲突率相对较高,这将有所帮助。这种方法的一个版本可能可以使用递归函数来实现,根据需要(在读取文件块时)深入到更深层次。总体而言,对于像你建议的方法,负面影响将是运行时内存使用 - 为了保持所有对象的文件:块数:哈希指针。尽管如此,在“常规”用例中,您所描述的问题可以通过阅读器中更大的“块”值来缓解;并且-大多数文件格式都以元信息@头开始,因此存在差异的可能性很大。 - Todor Minakov
显示剩余12条评论

47

递归文件夹版本:

这个版本使用文件大小和内容的哈希值来查找重复项。你可以传入多个路径,它会递归扫描所有路径并报告所有找到的重复项。

import sys
import os
import hashlib

def chunk_reader(fobj, chunk_size=1024):
    """Generator that reads a file in chunks of bytes"""
    while True:
        chunk = fobj.read(chunk_size)
        if not chunk:
            return
        yield chunk

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes = {}
    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                hashobj = hash()
                for chunk in chunk_reader(open(full_path, 'rb')):
                    hashobj.update(chunk)
                file_id = (hashobj.digest(), os.path.getsize(full_path))
                duplicate = hashes.get(file_id, None)
                if duplicate:
                    print "Duplicate found: %s and %s" % (full_path, duplicate)
                else:
                    hashes[file_id] = full_path

if sys.argv[1:]:
    check_for_duplicates(sys.argv[1:])
else:
    print "Please pass the paths to check as parameters to the script"

请问我是Python新手,如何传递路径参数? - X-Black...
1
@X-Black...将其作为命令行参数传递。例如:打开cmd提示符,导航到文件夹并键入:python myscript.py c:\path1 c:\path2 - nosklo
1
@Michael 使用try/except块;try: for chunk..... except OSError: print('Skipping file') - nosklo
请问您能解释一下chunk_reader函数的目的吗?您是想避免将大文件加载到内存中吗? - Alex
@Alex 是的,我有很大的文件,加载其中任意一个都会轻易超出我的可用内存。 - nosklo
显示剩余4条评论

25
def remove_duplicates(dir):
    unique = []
    for filename in os.listdir(dir):
        if os.path.isfile(filename):
            filehash = md5.md5(file(filename).read()).hexdigest()
            if filehash not in unique: 
                unique.append(filehash)
            else: 
                os.remove(filename)

//编辑:

对于MP3文件,您可能还会对这个主题感兴趣:如何检测具有不同比特率和/或不同ID3标签的重复MP3文件?


为了提高性能,你应该将unique更改为一个集合(尽管除非有大量小文件,否则它可能不是一个重要因素)。此外,如果dir中有一个目录,你的代码将会失败。在处理它们之前,请检查os.path.isfile()。 - Brian
是的,这段代码更像是一个基础。我添加了isfile,正如你所建议的那样。 - zalew
2
警告:我不知道为什么,但是你的MD5代码为许多文件生成了相同的哈希值,这些文件不是重复的。。。当我用hashlib.md5(open(filename, 'rb').read()).hexdigest()替换后,它正常工作了。 - Basj
我想知道为什么我们在这个任务中使用哈希?你能解释一下吗? - João Víctor Melo

6

更快的算法

如果需要分析许多“大型”文件(图像、mp3、pdf文档等),以下比较算法可能会更有趣/更快速:

  1. 对文件的前N个字节(比如1KB)执行第一个快速哈希。这个哈希会明确地指出文件是否不同,但无法确定两个文件是否完全相同(哈希的准确性受限于从磁盘读取的数据量)。

  2. 如果在第一阶段发生冲突,则执行第二个更准确的、慢速的哈希,并针对整个文件内容进行计算。

以下是该算法的实现:

import hashlib
def Checksum(current_file_name, check_type = 'sha512', first_block = False):
  """Computes the hash for the given file. If first_block is True,
  only the first block of size size_block is hashed."""
  size_block = 1024 * 1024 # The first N bytes (1KB)

  d = {'sha1' : hashlib.sha1, 'md5': hashlib.md5, 'sha512': hashlib.sha512}

  if(not d.has_key(check_type)):
    raise Exception("Unknown checksum method")

  file_size = os.stat(current_file_name)[stat.ST_SIZE]
  with file(current_file_name, 'rb') as f:
    key = d[check_type].__call__()
    while True:
      s = f.read(size_block)
      key.update(s)
      file_size -= size_block
      if(len(s) < size_block or first_block):
        break
  return key.hexdigest().upper()

def find_duplicates(files):
  """Find duplicates among a set of files.
  The implementation uses two types of hashes:
  - A small and fast one one the first block of the file (first 1KB), 
  - and in case of collision a complete hash on the file. The complete hash 
  is not computed twice.
  It flushes the files that seems to have the same content 
  (according to the hash method) at the end.
  """

  print 'Analyzing', len(files), 'files'

  # this dictionary will receive small hashes
  d = {}
  # this dictionary will receive full hashes. It is filled
  # only in case of collision on the small hash (contains at least two 
  # elements)
  duplicates = {}

  for f in files:

    # small hash to be fast
    check = Checksum(f, first_block = True, check_type = 'sha1')

    if(not d.has_key(check)):
      # d[check] is a list of files that have the same small hash
      d[check] = [(f, None)]
    else:
      l = d[check]
      l.append((f, None))

      for index, (ff, checkfull) in enumerate(l):

        if(checkfull is None):
          # computes the full hash in case of collision
          checkfull = Checksum(ff, first_block = False)
          l[index] = (ff, checkfull)

          # for each new full hash computed, check if their is 
          # a collision in the duplicate dictionary. 
          if(not duplicates.has_key(checkfull)):
            duplicates[checkfull] = [ff]
          else:
            duplicates[checkfull].append(ff)

  # prints the detected duplicates
  if(len(duplicates) != 0):
    print
    print "The following files have the same sha512 hash"

    for h, lf in duplicates.items():
      if(len(lf)==1):
        continue
      print 'Hash value', h
      for f in lf:
        print '\t', f.encode('unicode_escape') if \
          type(f) is types.UnicodeType else f
  return duplicates

find_duplicates函数接受一个文件列表作为参数,因此也可以比较两个目录(例如,更好地同步它们的内容)。下面是一个创建带有指定扩展名的文件列表并避免进入某些目录的函数示例:

def getFiles(_path, extensions = ['.png'], 
             subdirs = False, avoid_directories = None):
  """Returns the list of files in the path :'_path', 
     of extension in 'extensions'. 'subdir' indicates if 
     the search should also be performed in the subdirectories. 
     If extensions = [] or None, all files are returned.
     avoid_directories: if set, do not parse subdirectories that 
     match any element of avoid_directories."""

  l = []
  extensions = [p.lower() for p in extensions] if not extensions is None \
    else None
  for root, dirs, files in os.walk(_path, topdown=True):

    for name in files:
      if(extensions is None or len(extensions) == 0 or \
         os.path.splitext(name)[1].lower() in extensions):
        l.append(os.path.join(root, name))

    if(not subdirs):
      while(len(dirs) > 0):
        dirs.pop()
    elif(not avoid_directories is None):
      for d in avoid_directories:
        if(d in dirs): dirs.remove(d)

  return l    

这种方法很方便,不必解析.svn路径,否则会在find_duplicates中触发冲突文件。

欢迎提供反馈意见。


6
我之前用Python写过一个,您可以放心使用。
import sys
import os
import hashlib

check_path = (lambda filepath, hashes, p = sys.stdout.write:
        (lambda hash = hashlib.sha1 (file (filepath).read ()).hexdigest ():
                ((hash in hashes) and (p ('DUPLICATE FILE\n'
                                          '   %s\n'
                                          'of %s\n' % (filepath, hashes[hash])))
                 or hashes.setdefault (hash, filepath)))())

scan = (lambda dirpath, hashes = {}: 
                map (lambda (root, dirs, files):
                        map (lambda filename: check_path (os.path.join (root, filename), hashes), files), os.walk (dirpath)))

((len (sys.argv) > 1) and scan (sys.argv[1]))

3
我无法理解那里正在发生什么。如果你有机会,能否解释一下正在发生的事情? - tgray
我不明白这里发生了什么。 - quantum231

5

@IanLee1521在这里提供了一个很好的解决方案它非常高效,因为它首先基于文件大小检查重复。

#! /usr/bin/env python

# Originally taken from:
# http://www.pythoncentral.io/finding-duplicate-files-with-python/
# Original Auther: Andres Torres

# Adapted to only compute the md5sum of files with the same size

import argparse
import os
import sys
import hashlib


def find_duplicates(folders):
    """
    Takes in an iterable of folders and prints & returns the duplicate files
    """
    dup_size = {}
    for i in folders:
        # Iterate the folders given
        if os.path.exists(i):
            # Find the duplicated files and append them to dup_size
            join_dicts(dup_size, find_duplicate_size(i))
        else:
            print('%s is not a valid path, please verify' % i)
            return {}

    print('Comparing files with the same size...')
    dups = {}
    for dup_list in dup_size.values():
        if len(dup_list) > 1:
            join_dicts(dups, find_duplicate_hash(dup_list))
    print_results(dups)
    return dups


def find_duplicate_size(parent_dir):
    # Dups in format {hash:[names]}
    dups = {}
    for dirName, subdirs, fileList in os.walk(parent_dir):
        print('Scanning %s...' % dirName)
        for filename in fileList:
            # Get the path to the file
            path = os.path.join(dirName, filename)
            # Check to make sure the path is valid.
            if not os.path.exists(path):
                continue
            # Calculate sizes
            file_size = os.path.getsize(path)
            # Add or append the file path
            if file_size in dups:
                dups[file_size].append(path)
            else:
                dups[file_size] = [path]
    return dups


def find_duplicate_hash(file_list):
    print('Comparing: ')
    for filename in file_list:
        print('    {}'.format(filename))
    dups = {}
    for path in file_list:
        file_hash = hashfile(path)
        if file_hash in dups:
            dups[file_hash].append(path)
        else:
            dups[file_hash] = [path]
    return dups


# Joins two dictionaries
def join_dicts(dict1, dict2):
    for key in dict2.keys():
        if key in dict1:
            dict1[key] = dict1[key] + dict2[key]
        else:
            dict1[key] = dict2[key]


def hashfile(path, blocksize=65536):
    afile = open(path, 'rb')
    hasher = hashlib.md5()
    buf = afile.read(blocksize)
    while len(buf) > 0:
        hasher.update(buf)
        buf = afile.read(blocksize)
    afile.close()
    return hasher.hexdigest()


def print_results(dict1):
    results = list(filter(lambda x: len(x) > 1, dict1.values()))
    if len(results) > 0:
        print('Duplicates Found:')
        print(
            'The following files are identical. The name could differ, but the'
            ' content is identical'
            )
        print('___________________')
        for result in results:
            for subresult in result:
                print('\t\t%s' % subresult)
            print('___________________')

    else:
        print('No duplicate files found.')


def main():
    parser = argparse.ArgumentParser(description='Find duplicate files')
    parser.add_argument(
        'folders', metavar='dir', type=str, nargs='+',
        help='A directory to parse for duplicates',
        )
    args = parser.parse_args()

    find_duplicates(args.folders)


if __name__ == '__main__':
    sys.exit(main())

3
    import hashlib
    import os
    import sys
    from sets import Set

    def read_chunk(fobj, chunk_size = 2048):
        """ Files can be huge so read them in chunks of bytes. """
        while True:
            chunk = fobj.read(chunk_size)
            if not chunk:
                return
            yield chunk

    def remove_duplicates(dir, hashfun = hashlib.sha512):
        unique = Set()
        for filename in os.listdir(dir):
            filepath = os.path.join(dir, filename)
            if os.path.isfile(filepath):
                hashobj = hashfun()
                for chunk in read_chunk(open(filepath,'rb')):
                    hashobj.update(chunk)
                    # the size of the hashobj is constant
                    # print "hashfun: ", hashfun.__sizeof__()
                hashfile = hashobj.hexdigest()
                if hashfile not in unique:
                    unique.add(hashfile)
                else: 
                    os.remove(filepath)

    try:
        hashfun = hashlib.sha256
        remove_duplicates(sys.argv[1], hashfun)
    except IndexError:
        print """Please pass a path to a directory with 
        duplicate files as a parameter to the script."""

如何将目录设置为我的目录...? - X-Black...

2

Python有一个名为filecmp的标准库,用于比较文件和目录。

它检查文件大小。它以8k块检查内容。它适用于二进制文件。

它不进行哈希处理。

filecmp的Python文档


filecmp 用于检查少量文件非常好用。但如果你需要处理更多的文件,它可能会变得相当慢。使用 filecmp 检查大小相等的 n 个文件中是否有重复项,需要打开并(部分)读取每个文件 n-1 次,而哈希只需要一次。 - Timus

0

我已经找到了一个可以百分之百工作的代码,可以递归地删除文件夹中的重复文件。只需在clean方法中替换文件夹名称为您的文件夹名称即可。

import time
import os
import shutil
from hashlib import sha256


class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd()
        self.File_hashes = []
        self.Cleaned_dirs = []
        self.Total_bytes_saved = 0
        self.block_size = 65536
        self.count_cleaned = 0

    def welcome(self) -> None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        return None

    def generate_hash(self, Filename: str) -> str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock) > 0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False

    def clean(self) -> None:
        all_dirs = [path[0] for path in os.walk('E:\\songs')]
        for path in all_dirs:
            os.chdir(path)
            All_Files = [file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:
                        self.File_hashes.append(filehash)
                        # print(file)
                else:
                    byte_saved = os.path.getsize(file)
                    self.count_cleaned += 1
                    self.Total_bytes_saved += byte_saved
                    os.remove(file)
                    filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)

    def cleaning_summary(self) -> None:
        mb_saved = self.Total_bytes_saved / 1048576
        mb_saved = round(mb_saved, 2)
        print('\n\n--------------FINISHED CLEANING ------------')
        print('File cleaned  : ', self.count_cleaned)
        print('Total Space saved : ', mb_saved, 'MB')
        print('-----------------------------------------------')

    def main(self) -> None:
        self.welcome()
        self.clean()
        self.cleaning_summary()


#
# if __name__ == '__main__':
#     App = Duplython()
#     App.main()


def dedupe_bing_images():
    App = Duplython()
    App.main()
    return True


dedupe_bing_images()

0
为了安全起见(如果出现问题,自动删除可能会很危险!),以下是我使用的基于@zalew答案的方法。
请注意,md5校验和代码与@zalew的略有不同,因为他的代码生成了太多错误的重复文件(这就是我说自动删除它们很危险的原因!)。
import hashlib, os
unique = dict()
for filename in os.listdir('.'):
    if os.path.isfile(filename):
        filehash = hashlib.md5(open(filename, 'rb').read()).hexdigest()

        if filehash not in unique: 
            unique[filehash] = filename
        else:
            print filename + ' is a duplicate of ' + unique[filehash]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接