我有很多源文件需要添加到git仓库,如何快速完成?

4

我正在这里寻找有关快速将大量文件导入到git仓库的灵感,但不确定是否适用。

基本情况是,我有超过 1 亿个文件要提交到git仓库。我已经将它们分成了大约 5 层的目录。将git add path/2/3操作深入几层需要大约 5 分钟的时间。然后进行提交和发布。这需要很长时间,可能要花费数月的时间才能提交所有这些文件。

请不要纠结于为什么我要把它们存储在git中,以及它们是否是源文件,是否存在更好的解决方案等问题。我只想知道git可以处理多少数据,并且是否可以以更优化的方式处理这么多的文件。

顺便说一下,这些都是配置文件或类似CSV的数据文件,其中一些非常大,大多数都很小。

如果我尝试提交整个文件夹或仅一个大片段,可能需要一个小时才能完成它们的提交。但是进行发布可能需要几个小时,而且我已经尝试过了,通常情况下互联网会断开连接,你就得重新开始。因此,我认为这不是一个可行的解决方案。

我想知道的是,是否有一种方法可以直接将所有内容加载到git中,并绕过它在提交时执行的所有git操作。然后创建一个提交。然后像rsync一样发布,这样就更加健壮,即使网络连接断开也不会出现问题。那么它就像一个正常的上传一样。


1
除了一次添加较少的文件,我不知道如何让 git add 运行更快。 - Tim Biegeleisen
1
你提供的链接文章提到,对于10k个提交可能需要几个小时,因此您可以选择更快的硬件(CPU和磁盘)或首先将文件加载到RAMDISK中。假设这是*nix,您可以尝试将权限设置为777。 - doublesharp
2个回答

2

Git数据库可以存储少量的文件(技术上称为blob对象),没有太多硬性限制。1然而,还是有一些软性限制。

我手头上有两个相当大的代码库——FreeBSD和Linux,它们分别包含570万和670万个对象。这远远少于1亿个文件:Linux代码库只有其大小的1/15,而且其中很多对象都是提交和树形结构,不是文件。

请注意,将1亿个文件放入一个提交中与将1亿个文件放入1亿个提交中有所不同,每个提交只存储一个文件。前者需要建立一个列出1亿个文件的索引,这是几个GB的索引文件,可能会很慢,但随后会存储1亿个blob对象、每个目录一个树形结构对象和一个提交对象。后者将建立一个小索引(1个文件),使用一个包含一个blob对象的树形结构进行一次提交,然后重复100万次:索引永远不会很大,但代码库将存储3亿个对象:1亿个提交,每个提交包含1个树形结构和1个blob对象。

所有时间去了哪里并不是很明显。git add <path>需要:

  • 压缩文件的内容并创建一个新的blob对象,或者如果压缩哈希ID是现有对象的哈希ID,则重用现有的blob对象;然后
  • 更新索引,使适当文件名的暂存槽零出现在正确的位置。

索引是排序的,因此这个更新可能非常快——如果新文件放在索引末尾,那么只需要一个多少字节的附加即可——或者非常慢:在前面插入将是O(n2),其中n是索引中已有条目的数量,因为它们都必须向下移动。 实际上,Git会将索引读入内存,在那里执行操作,然后将索引写回,因此一旦索引超过某个大小阈值(这将根据OS和底层存储介质类型/速度而变化),它可能会非常缓慢。

在打包对象之间,您可能还需要大量的磁盘空间。现代Git会在每次提交后运行git gc --auto,但在早期的Git和2.17.0(修复之前),git commit意外地没有这样做。考虑到您的情况,您可能想要禁用自动git gc,并在受控间隔内运行它,或者像您链接的文档中一样,使用git fast-import构建一个包文件,而不是通过正常的Git渠道。这将完全避免索引的需要(直到您运行git checkout来提取其中一个提交为止)。


1唯一真正的硬性限制是只有2160个可能的哈希ID。然而,当你达到约1.7千万亿个对象时,你会遇到明显高的哈希冲突概率,大约为10-18的数量级——这也是许多磁盘制造商报告的未纠正比特误差率。


0

git fast-import(被git filter-repo等工具使用)确实是一个不错的选择,并且在Git 2.27(2020年第二季度)中,它甚至更快。

"git fast-import"使用的自定义哈希函数已被hashmap.c中的函数所取代,这给性能带来了很好的提升。

查看提交d8410a8(2020年4月6日)由Jeff King (peff)完成。
(由Junio C Hamano -- gitster --提交6ae3c79中合并,2020年4月28日)

fast-import:使用hashmap.c替换自定义哈希

签名作者:Jeff King

We use a custom hash in fast-import to store the set of objects we've imported so far. It has a fixed set of 2^16 buckets and chains any collisions with a linked list.
As the number of objects grows larger than that, the load factor increases and we degrade to O(n) lookups and O(n^2) insertions.

We can scale better by using our hashmap.c implementation, which will resize the bucket count as we grow.
This does incur an extra memory cost of 8 bytes per object, as hashmap stores the integer hash value for each entry in its hashmap_entry struct (which we really don't care about here, because we're just reusing the embedded object hash).
But I think the numbers below justify this (and our per-object memory cost is already much higher).

I also looked at using khash (here, see article), but it seemed to perform slightly worse than hashmap at all sizes, and worse even than the existing code for small sizes.
It's also awkward to use here, because we want to look up a "struct object_entry" from a "struct object_id", and it doesn't handle mismatched keys as well.
Making a mapping of object_id to object_entry would be more natural, but that would require pulling the embedded oid out of the object_entry or incurring an extra 32 bytes per object.

In a synthetic test creating as many cheap, tiny objects as possible

perl -e '
   my $bits = shift;
   my $nr = 2**$bits;

for (my $i = 0; $i < $nr; $i++) {
       print "blob\n";
       print "data 4\n";
       print pack("N", $i);
}
its | git fast-import

I got these results:

nr_objects   master       khash      hashmap

2^20 0m4.317s 0m5.109s 0m3.890s 2^21 0m10.204s 0m9.702s 0m7.933s 2^22 0m27.159s 0m17.911s 0m16.751s 2^23 1m19.038s 0m35.080s 0m31.963s 2^24 4m18.766s 1m10.233s 1m6.793s

which points to hashmap as the winner.

We didn't have any perf tests for fast-export or fast-import, so I added one as a more real-world case.
It uses an export without blobs since that's significantly cheaper than a full one, but still is an interesting case people might use (e.g., for rewriting history).
It will emphasize this change in some ways (as a percentage we spend more time making objects and less shuffling blob bytes around) and less in others (the total object count is lower).

Here are the results for linux.git:

Test                        HEAD^                 HEAD
----------------------------------------------------------------------------
9300.1: export (no-blobs)   67.64(66.96+0.67)     67.81(67.06+0.75) +0.3%
9300.2: import (no-blobs)   284.04(283.34+0.69)   198.09(196.01+0.92) -30.3%

It only has ~5.2M commits and trees, so this is a larger effect than I expected (the 2^23 case above only improved by 50s or so, but here we gained almost 90s).
This is probably due to actually performing more object lookups in a real import with trees and commits, as opposed to just dumping a bunch of blobs into a pack.


Git 2.37 (Q3 2022) 展示了如何检查object_entry的大小

请参见提交14deb58(2022年6月28日),作者为Taylor Blau(ttaylorr
(由Junio C Hamano -- gitster --提交b59f04f中合并,2022年7月13日)

pack-objects.h:删除过时的pahole结果

签名作者:Taylor Blau

The size and padding of struct object_entry`` is an important factor in determining the memory usage of pack-objects.
For this reason, 3b13a5f (pack-objects: reorder members to shrink struct object_entry, 2018-04-14, Git v2.18.0-rc0 -- merge listed in batch #6) (pack-objects: reorder members to shrink struct object_entry, 2018-04-14) added a comment containing some information from pahole indicating the size and padding of that struct.

Unfortunately, this comment hasn't been updated since 9ac3f0e ("pack-objects: fix performance issues on packing large deltas", 2018-07-22, Git v2.19.0-rc1 -- merge), despite the size of this struct changing many times since that commit.

To see just how often the size of object_entry changes, I skimmed the first-parent history with this script:

for sha in $(git rev-list --first-parent --reverse 9ac3f0e..)
do
  echo -n "$sha "
  git checkout -q $sha
  make -s pack-objects.o 2>/dev/null
  pahole -C object_entry pack-objects.o | sed -n \
    -e 's/\/\* size: \([0-9]*\).*/size \1/p' \
    -e 's/\/\*.*padding: \([0-9]*\).*/padding \1/p' | xargs
done | uniq -f1

In between each merge, the size of object_entry changes too often to record every instance here.
But the important merges (along with their corresponding sizes and bit paddings) in chronological order are:

ad635e82d6 (Merge branch 'nd/pack-objects-pack-struct', 2018-05-23) size 80 padding 4
29d9e3e2c4 (Merge branch 'nd/pack-deltify-regression-fix', 2018-08-22) size 80 padding 9
3ebdef2e1b (Merge branch 'jk/pack-delta-reuse-with-bitmap', 2018-09-17) size 80 padding 8
33e4ae9c50 (Merge branch 'bc/sha-256', 2019-01-29) size 96 padding 8

(indicating that the current size of the struct is 96 bytes, with 8 padding bits).

Even though this comment was written in a good spirit, it is updated infrequently enough that it serves to confuse rather than to encourage contributors to update the appropriate values when the modify the definition of object_entry.

For that reason, eliminate the confusion by removing the comment altogether.


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接