使用Bash递归查找同名但实际上不同的文件的最佳方法是什么？

Question

使用Bash递归查找同名但实际上不同的文件的最佳方法是什么？

linuxbashunix

4

我有大约15000张图片，文件结构嵌套，它们的名称是SKU。我需要确保没有具有相同SKU但实际上不同的文件。

例如，如果我有两个或更多名为MYSKU.jpg的文件，则需要确保它们之间没有任何不同。

在bash命令中，最好的方法是什么？

- bruchowski

我真的不明白为什么这篇文章没有任何作者尝试解决问题的证据，却能得到这么多的赞。 - Pankrates

@Pankrates 我是在问“最好的”或最被接受的方法，我希望能得到一个一行代码就能解决的答案。尽管如此，我只是提出了这个问题，因为我没有找到其他 Stack Overflow 的问题能够很好地回答它。 - bruchowski

3个回答

1

这个想法是扫描目录中的所有文件，并检查哪些文件名相同但基于MD5校验和具有不同内容。

#!/bin/bash

# directory to scan
scan_dir=$1

[ ! -d "$1" ] && echo "Usage $0 <scan dir>" && exit 1

# Associative array to save hash table
declare -A HASH_TABLE
# Associative array of full path of items
declare -A FULL_PATH


for item in $( find $scan_dir -type f ) ; do
    file=$(basename $item)
    md5=$(md5sum $item | cut -f1 -d\ )
    if [ -z "${HASH_TABLE[$file]}" ] ; then
        HASH_TABLE[$file]=$md5
        FULL_PATH[$file]=$item
    else
        if [ "${HASH_TABLE[$file]}" != "$md5" ] ; then
            echo "differ $item from ${FULL_PATH[$file]}"
        fi
    fi
done

使用方法（假设您将脚本文件命名为scan_dir.sh）：

$ ./scan_dir.sh /path/to/you/directory

- Bechir

+1，但是（a）不要使用for来解析命令输出（请参见@Pavel答案或http://mywiki.wooledge.org/ParsingLs进行讨论和替代方案），（b）双引号所有`$scan_dir`和`$item`引用，（c）将用法信息发送到_stderr_，因为您正在报告一个_error_。 - mklement0

0

以下是我将如何使用bash 4来解决它的方法：

#!/usr/local/bin/bash -vx

#!/usr/local/bin/bash -vx

shopt -s globstar # turn on recursive globbing
shopt -s nullglob # hide globs that don't match anything
shopt -s nocaseglob # match globs regardless of capitalization

images=( **/*.{gif,jpeg,jpg,png} ) # all the image files
declare -A homonyms # associative array of like named files

for i in "${!images[@]}"; do # iterate over indices
    base=${images[i]##*/} # file name without path
    homonyms["$base"]+="$i " # Space delimited list of indices for this basename
done

for base in "${!homonyms[@]}"; do # distinct basenames
    unset dupehashes; declare -A dupehashes # temporary var for hashes
    indices=( ${homonyms["$base"]} ) # omit quotes to allow expansion of space-delimited integers
    (( ${#indices[@]} > 1 )) || continue # ignore unique names
    for i in "${indices[@]}"; do
        dupehashes[$(md5 < "${images[i]}")]+="$i "
    done

    (( ${#dupehashes[@]} > 1 )) || continue # ignore if same hash
    echo
    printf 'The following files have different hashes: '
    for h in "${!dupehashes[@]}"; do
        for i in ${dupehashes[$h]}; do # omit quotes to expand space-delimited integer list
            printf '%s %s\n' "$h" "${images[i]}"
        done
    done
done

我知道上面看起来很多，但是我认为如果有15k张图片，你真的想避免打开(open())和校验那些不必要的图片，所以这种方法是针对将数据集缩小到重复文件名，然后仅对其进行哈希处理。正如其他人之前所说，你可以在哈希处理之前通过检查文件大小来使其更快，但我会留下这部分未完成。

- kojiro

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pavel · Accepted Answer

我不想完全替你解决这个任务，但是以下是一些有用的要素，你可以尝试并整合：

find /path -type f   # gives you a list of all files in /path

您可以像这样遍历列表。

for f in $(find /path -type f -name '*.jpg'); do
  ...
done

现在你可以考虑循环中需要收集的内容。我建议：

base=$(basename $f)
full_path=$f
hash=$(echo $f | md5sum | awk '{print $1}')

现在，您可以将此信息存储在文件的三列中，以便每行都包含查找重复文件所需的所有信息。

由于您没有解释如何处理重复项，这里只是建议如何发现它们。然后由您决定如何处理它们。

给定上面获取的列表，您可以存储其中两个副本：一个按basename排序，另一个按basename 排除重复项排序：

sort -k2    list.txt | column -t > list.sorted.txt       
sort -k2 -u list.txt | column -t > list.sorted.uniq.txt

假设基础名称位于第二列

现在运行

diff list.sorted.txt list.sorted.uniq.txt

查看具有相同名称的文件。现在，您可以从每一行中提取MD5校验和来验证它们是否真正不同，并提取完整路径以执行一些操作，例如mv、rm、ln等。