Git是如何计算文件哈希值的？

Question

Git是如何计算文件哈希值的？

142

在树对象中存储的SHA1哈希值（由git ls-tree返回）与文件内容的SHA1哈希值（由sha1sum返回）不匹配：

$ git cat-file blob 4716ca912495c805b94a88ef6dc3fb4aff46bf3c | sha1sum
de20247992af0f949ae8df4fa9a37e4a03d7063e  -

Git如何计算文件哈希值？它在计算哈希值之前是否对内容进行压缩？

- netvope

13

查看在没有Git的情况下分配Git SHA1。 - sanmai

1

更多详情请参见http://progit.org/book/ch9-2.html。 - netvope

5

netvope的链接现在似乎已经失效。我认为这是新位置：http://git-scm.com/book/en/Git-Internals-Git-Objects，它是从http://git-scm.com/book中的§9.2。 - Rhubbarb

相关：git提交对象的文件格式是什么？ - kenorb

6个回答

37

我只是在 @Leif Gruenwoldt 的答案上进行了拓展，并详细说明了由 @Leif Gruenwoldt 提供的参考资料中的内容。

自己动手做...

步骤1. 在您的代码库中创建一个空文本文档（名称不重要）

步骤2. 暂存和提交该文档

步骤3. 通过执行 git ls-tree HEAD 来标识 blob 的哈希值

步骤4. 找到 blob 的哈希值为 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

步骤5. 摆脱惊讶，并阅读下面的内容

GIT如何计算其提交哈希值

    Commit Hash (SHA1) = SHA1("blob " + <size_of_file> + "\0" + <contents_of_file>)

文本 blob⎵ 是一个常量前缀，\0 也是常量并且是 NULL 字符。 <size_of_file> 和 <contents_of_file> 根据文件而异。

参见：什么是 Git 提交对象的文件格式？

好了，就这些！

但是等等！你是否注意到 <filename> 不是用于哈希计算的参数？如果两个文件的内容相同，那么它们的哈希可能相同，与它们创建的日期时间和名称无关。这是 Git 处理移动和重命名比其他版本控制系统更好的原因之一。

自己动手（扩展）

第6步：在同一目录中创建另一个不同的 filename 的空文件

第7步：比较您两个文件的哈希值。

注意：

链接未提及如何对 tree 对象进行哈希。我不确定算法和参数，但从我的观察来看，它可能基于它包含的所有 blobs 和 trees（它们的哈希值可能）计算哈希。

- Lordbalmon

SHA1("blob" + <size_of_file>) - 在“blob”和文件大小之间是否有额外的空格字符？文件大小是十进制吗？它是否以零为前缀？ - osgx

1

@osgx 有的。参考资料和我的测试都证实了这一点。我已经更正了答案。大小似乎是以整数字节为单位，没有前缀。 - Samuel Harmer

18

git hash-object

这是验证测试方法的快速方式：

s='abc'
printf "$s" | git hash-object --stdin
printf "blob $(printf "$s" | wc -c)\0$s" | sha1sum

输出：

f2ba8f84ab5c1bce84a7b441cb1959cfc7093b7f
f2ba8f84ab5c1bce84a7b441cb1959cfc7093b7f  -

其中 sha1sum 在GNU Coreutils中。

然后问题就落到了理解每个对象类型的格式上。我们已经涵盖了简单的blob，以下是其他类型:

- Ciro Santilli OurBigBook.com

如前面的答案所述，长度应该被计算为 $(printf "\0$s" | wc -c)。请注意添加的空字符。也就是说，如果字符串是 'abc'，在前面加上了一个空字符，则长度将会是 4，而不是 3。这样计算出来的结果将与 sha1sum 的结果匹配 git hash-object 的结果。 - Michael Ekoka

你是对的，它们确实匹配。似乎使用printf而不是echo -e会产生一些有害的副作用。当你将git hash-object应用于包含字符串“abc”的文件时，你得到的是8baef1b...f903，这与使用echo -e而不是printf时得到的结果相同。只要echo -e在字符串末尾添加一个换行符，似乎为了与printf的行为匹配，你可以做同样的事情（即s="$s\n"）。 - Michael Ekoka

1

点赞使用 printf 而不是 echo -e。 - go2null

4

我需要在Python 3中进行一些单元测试，所以想把它留在这里。

def git_blob_hash(data):
    if isinstance(data, str):
        data = data.encode()
    data = b'blob ' + str(len(data)).encode() + b'\0' + data
    h = hashlib.sha1()
    h.update(data)
    return h.hexdigest()

我在任何地方都坚持使用\n行结尾，但在某些情况下，Git 在计算哈希之前也可能会更改您的行结尾，因此您可能还需要在其中加入.replace('\r\n', '\n')。

- Samuel Harmer

3

基于 Leif Gruenwoldt 的回答，这里是一个替代 git hash-object 的 shell 函数：

git-hash-object () { # substitute when the `git` command is not available
    local type=blob
    [ "$1" = "-t" ] && shift && type=$1 && shift
    # depending on eol/autocrlf settings, you may want to substitute CRLFs by LFs
    # by using `perl -pe 's/\r$//g'` instead of `cat` in the next 2 commands
    local size=$(cat $1 | wc -c | sed 's/ .*$//')
    ( echo -en "$type $size\0"; cat "$1" ) | sha1sum | sed 's/ .*$//'
}

测试：

$ echo 'Hello, World!' > test.txt
$ git hash-object test.txt
8ab686eafeb1f44702738c8b0f24f2567c36da6d
$ git-hash-object test.txt
8ab686eafeb1f44702738c8b0f24f2567c36da6d

- Lucas Cimon

0

这是一个用于计算二进制哈希值的Python3版本（上面的示例是针对文本的）。

为了提高可读性，将此代码放在您自己的def中。还请注意，该代码只是一个片段，不是完整的脚本。供您参考。

    targetSize: int
exists: bool
if os.path.exists(targetFile):
    exists = True
    targetSize = os.path.getsize(targetFile)
else:
    exists = False
    targetSize = 0
openMode: str
if exists:
    openMode = 'br+'
else:
    openMode = 'bw+'
with open(targetFile, openMode) as newfile:
    if targetSize > 0:
        header: str = f"blob {targetSize}\0"
        headerBytes = header.encode('utf-8')
        headBytesLen = len(headerBytes)
        buffer = bytearray(headBytesLen + targetSize)
        buffer[0:0+headBytesLen] = headerBytes
        buffer[headBytesLen:headBytesLen+targetSize] = newfile.read()
        sha1Hash = hashlib.sha1(buffer).hexdigest()
        if not sha == sha1Hash:
            newfile.truncate()
        else:
            continue
    with requests.get(fullFile) as response2:            
        newfile.write(response2.content)

- user2410689

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Leif Gruenwoldt · Accepted Answer

141

Git在对象前加上“blob”，后跟长度（作为可读的整数），再后跟一个NUL字符。 $ echo -e 'blob 14\0Hello, World!' | shasum 8ab686eafeb1f44702738c8b0f24f2567c36da6d 来源：http://alblue.bandlem.com/2011/08/git-tip-of-week-objects.html

- Leif Gruenwoldt

3

值得一提的是它会将 "\r\n" 替换成 "\n"，但独立的 "\r" 则不会被处理。 - user420667

9

有时候根据换行符和自动转换设置，Git 会进行上述替换，但并非总是这样。 - user420667

11

你还可以将此与echo 'Hello, World!' | git hash-object --stdin 的输出进行比较。你可以选择指定 --no-filters 确保不进行 crlf 转换，或者指定 --path=somethi.ng 让 Git 使用通过 gitattributes 指定的过滤器（也 @user420667）。而 -w 则是实际将 blob 提交到 .git/objects（如果你在 git 存储库中）。 - Tobias Kienzler

1

表达等价性，使其有意义：echo -e 'blob 16\0Hello, \r\nWorld!' | shasum == echo -e 'Hello, \r\nWorld!' | git hash-object --stdin --no-filters，并且它也等同于\n和15。 - Peter Krauss

1

echo appends a newline to the output, which is also passed into git. That's why its 14 characters.To use echo without a newline, write echo -n 'Hello, World!' - Bouke Versteegh

显示剩余4条评论