如何快速创建哈希函数以检查两个文件是否相同?
安全性不是非常重要。
编辑:我会通过网络连接发送文件,并确保双方的文件相同。
如何快速创建哈希函数以检查两个文件是否相同?
安全性不是非常重要。
编辑:我会通过网络连接发送文件,并确保双方的文件相同。
PS:它依赖于Apache commons codec以高效地生成哈希。
示例用法:new DuplicateFileFinder("MD5").findDuplicateFilesList(filesList);
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.commons.codec.digest.DigestUtils;
/**
* Finds the duplicate files using md5/sha1 hashing, which is used only for the sizes which are of same size.
*
* @author HemantSingh
*
*/
public class DuplicateFileFinder {
private HashProvider hashProvider;
// Used only for logging purpose.
private String hashingAlgo;
public DuplicateFileFinder(String hashingAlgo) {
this.hashingAlgo = hashingAlgo;
if ("SHA1".equalsIgnoreCase(hashingAlgo)) {
hashProvider = new Sha1HashProvider();
} else if ("MD5".equalsIgnoreCase(hashingAlgo)) {
hashProvider = new Md5HashProvider();
} else {
throw new RuntimeException("Unsupported hashing algorithm:" + hashingAlgo + " Please use either SHA1 or MD5.");
}
}
/**
* This API returns the list of duplicate files reference.
*
* @param files
* - List of all the files which we need to check for duplicates.
* @return It returns the list which contains list of duplicate files for
* e.g. if a file a.JPG have 3 copies then first element in the list
* will be list with three references of File reference.
*/
public List<List<File>> findDuplicateFilesList(List<File> files) {
// First create the map for the file size and file reference in the array list.
Map<Long, List<File>> fileSizeMap = new HashMap<Long, List<File>>();
List<Long> potDuplicateFilesSize = new ArrayList<Long>();
for (Iterator<File> iterator = files.iterator(); iterator.hasNext();) {
File file = (File) iterator.next();
Long fileLength = new Long(file.length());
List<File> filesOfSameLength = fileSizeMap.get(fileLength);
if (filesOfSameLength == null) {
filesOfSameLength = new ArrayList<File>();
fileSizeMap.put(fileLength, filesOfSameLength);
} else {
potDuplicateFilesSize.add(fileLength);
}
filesOfSameLength.add(file);
}
// If we don't have any potential duplicates then skip further processing.
if (potDuplicateFilesSize.size() == 0) {
return null;
}
System.out.println(potDuplicateFilesSize.size() + " files will go thru " + hashingAlgo + " hash check to verify if they are duplicate.");
// Now we will scan the potential duplicate files, and eliminate false positives using md5 hash check.
List<List<File>> finalListOfDuplicates = new ArrayList<List<File>>();
for (Iterator<Long> potDuplicatesFileSizeIterator = potDuplicateFilesSize
.iterator(); potDuplicatesFileSizeIterator.hasNext();) {
Long fileSize = (Long) potDuplicatesFileSizeIterator.next();
List<File> potDupFiles = fileSizeMap.get(fileSize);
Map<String, List<File>> trueDuplicateFiles = new HashMap<String, List<File>>();
for (Iterator<File> potDuplicateFilesIterator = potDupFiles.iterator(); potDuplicateFilesIterator
.hasNext();) {
File file = (File) potDuplicateFilesIterator.next();
try {
String md5Hex = hashProvider.getHashHex(file);
List<File> listOfDuplicatesOfAFile = trueDuplicateFiles.get(md5Hex);
if (listOfDuplicatesOfAFile == null) {
listOfDuplicatesOfAFile = new ArrayList<File>();
trueDuplicateFiles.put(md5Hex, listOfDuplicatesOfAFile);
}
listOfDuplicatesOfAFile.add(file);
} catch (IOException e) {
e.printStackTrace();
}
}
Collection<List<File>> dupsOfSameSizeList = trueDuplicateFiles.values();
for (Iterator<List<File>> dupsOfSameSizeListIterator = dupsOfSameSizeList.iterator(); dupsOfSameSizeListIterator
.hasNext();) {
List<File> list = (List<File>) dupsOfSameSizeListIterator.next();
// It will be duplicate only if we have more then one copy of it.
if (list.size() > 1) {
finalListOfDuplicates.add(list);
System.out.println("Duplicate sets found: " + finalListOfDuplicates.size());
}
}
}
return finalListOfDuplicates;
}
abstract class HashProvider {
abstract String getHashHex(File file) throws IOException ;
}
class Md5HashProvider extends HashProvider {
String getHashHex(File file) throws IOException {
return DigestUtils.md5Hex(new FileInputStream(file));
}
}
class Sha1HashProvider extends HashProvider {
String getHashHex(File file) throws IOException {
return DigestUtils.sha1Hex(new FileInputStream(file));
}
}
}
你可以看看Samba/Rsync开发者使用的算法。我没有深入研究过,但是我经常看到它被提到。显然它相当不错。
您可以在一步中完成所有操作-如果不需要排序直到发现重复的列表,为什么还要排序呢:
nice find . -type f -print0 \
\
| xargs -0 -P 8 xxh128sum --tag | pvZ -i 0.5 -l -cN in0 \
\
| mawk2 'BEGIN {
_=(FS="(^XXH(32|64|128)[ ][(]|[)][ ]["(OFS="=")"][ ])")<""
} __[$(NF=NF)]--<-_ ' | pvZ -i 0.5 -l -cN out9 \
\
| LC_ALL=C gsort -f -t= -k 3,3n -k 2,2 | gcat -n | lgp3 3
在awk中,用于定位唯一项的逻辑可以通过其关联哈希数组功能来查找重复项 - 这只是同一硬币的另一面 - 只有在列表较小时才进行排序。
我在自己的文件夹中得到了这样的输出,里面充满了重复项:
24131 =./songChunk_93634773_0011_.8045v202108091628512738.ts=56075211016703871f208f88839e2acc
24132 =./songChunk_93634773_0011_.8045v202108091628512772.ts=56075211016703871f208f88839e2acc
24133 =./songChunk_93634773_0011_.8045v202108091628512806.ts=56075211016703871f208f88839e2acc
24134 =./songChunk_93634773_0011_.8045v202108091628512839.ts=56075211016703871f208f88839e2acc
24135 =./songChunk_93666774_0043_.7102v202108091628512485.ts=77643645774287386a02e83808a632ed
24136 =./songChunk_93666774_0043_.7102v202108091628536916.ts=77643645774287386a02e83808a632ed
24137 =./songChunk_92647129_0023_.8045v202108091628536907.ts=146289716096910587a15001b5d2e9d6
24138 =./songChunk_92647129_0023_.8045v202108091628536946.ts=146289716096910587a15001b5d2e9d6
话虽如此,我这种懒惰的方法的主要限制是它看到的具有相同哈希值的第一个文件就是它保留的文件,因此如果您关心时间戳和命名等所有内容,那么是的,您将不得不进行并行调用stat
以获取所有精确的时间戳、inode号码和所有提供的TMI。
——4Chan Teller
我记得旧的调制解调器传输协议,比如Zmodem,在发送每个块时会进行某种CRC比较。如果我记得古老的历史足够清楚,那么是CRC32。我不建议您制作自己的传输协议,除非这正是您要做的,但您可以定期检查文件块,或者对每个8k块进行哈希处理,这可能足够简单,以便处理器能够处理。我自己还没有尝试过。
比较两个完全相同的文件最人性化的方式是比较哈希值,根据我的测试结果,"fnv1a64"算法是最适合文件哈希需求且不会给处理器增加负担的最佳选择,无需添加额外的应用程序,几乎所有服务器软件版本都支持这个算法。
总共需要扫描的文件数量为877958
总执行时间为22353.953384876
执行时间 | 算法 |
---|---|
287.33039116859 | xxh64 |
287.48988413811 | murmur3f |
288.99125409126 | xxh128 |
289.31353402138 | xxh3 |
289.95356488228 | murmur3a |
291.07013106346 | crc32 |
292.12370014191 | murmur3c |
294.61375999451 | xxh32 |
294.80482602119 | crc32c |
295.07662701607 | adler32 |
300.95663881302 | crc32b |
302.49028491974 | joaat |
303.52309799194 | fnv132 |
305.78797912598 | md4 |
306.00277495384 | fnv164 |
307.34593200684 | fnv1a32 |
308.21020579338 | tiger192,3 |
308.65456604958 | md5 |
311.2399160862 | tiger160,4 |
311.49055314064 | sha1 |
311.90242314339 | fnv1a64 |
312.08589076996 | tiger160,3 |
314.24259185791 | tiger128,3 |
315.16137599945 | tiger192,4 |
315.28597712517 | haval224,3 |
316.56637001038 | haval160,3 |
317.15466785431 | haval128,3 |
319.63687586784 | haval192,3 |
320.57202386856 | sha512 |
320.92411613464 | tiger128,4 |
321.84568786621 | haval256,3 |
322.48667311668 | sha384 |
327.83715701103 | haval192,4 |
328.23257398605 | haval160,4 |
329.07334899902 | haval224,4 |
330.29799699783 | haval256,4 |
331.03335189819 | sha3-256 |
332.09470105171 | ripemd256 |
332.61167287827 | haval128,4 |