git fast-import
(被git filter-repo
等工具使用)确实是一个不错的选择,并且在Git 2.27(2020年第二季度)中,它甚至更快。
"git fast-import
"使用的自定义哈希函数已被hashmap.c
中的函数所取代,这给性能带来了很好的提升。
查看提交d8410a8(2020年4月6日)由Jeff King (peff
)完成。
(由Junio C Hamano -- gitster
--在提交6ae3c79中合并,2020年4月28日)
签名作者:Jeff King
We use a custom hash in fast-import to store the set of objects we've imported so far. It has a fixed set of 2^16 buckets and chains any collisions with a linked list.
As the number of objects grows larger than that, the load factor increases and we degrade to O(n)
lookups and O(n^2)
insertions.
We can scale better by using our hashmap.c
implementation, which will resize the bucket count as we grow.
This does incur an extra memory cost of 8 bytes per object, as hashmap stores the integer hash value for each entry in its hashmap_entry
struct (which we really don't care about here, because we're just reusing the embedded object hash).
But I think the numbers below justify this (and our per-object memory cost is already much higher).
I also looked at using khash (here, see article), but it seemed to perform slightly worse than hashmap at all sizes, and worse even than the existing code for small sizes.
It's also awkward to use here, because we want to look up a "struct object_entry
" from a "struct object_id
", and it doesn't handle mismatched keys as well.
Making a mapping of object_id
to object_entry
would be more natural, but that would require pulling the embedded oid
out of the object_entry
or incurring an extra 32 bytes per object.
In a synthetic test creating as many cheap, tiny objects as possible
perl -e '
my $bits = shift;
my $nr = 2**$bits;
for (my $i = 0; $i < $nr; $i++) {
print "blob\n";
print "data 4\n";
print pack("N", $i);
}
its | git fast-import
I got these results:
nr_objects master khash hashmap
2^20 0m4.317s 0m5.109s 0m3.890s
2^21 0m10.204s 0m9.702s 0m7.933s
2^22 0m27.159s 0m17.911s 0m16.751s
2^23 1m19.038s 0m35.080s 0m31.963s
2^24 4m18.766s 1m10.233s 1m6.793s
which points to hashmap as the winner.
We didn't have any perf tests for fast-export or fast-import, so I added one as a more real-world case.
It uses an export without blobs since that's significantly cheaper than a full one, but still is an interesting case people might use (e.g., for rewriting history).
It will emphasize this change in some ways (as a percentage we spend more time making objects and less shuffling blob bytes around) and less in others (the total object count is lower).
Here are the results for linux.git:
Test HEAD^ HEAD
9300.1: export (no-blobs) 67.64(66.96+0.67) 67.81(67.06+0.75) +0.3%
9300.2: import (no-blobs) 284.04(283.34+0.69) 198.09(196.01+0.92) -30.3%
It only has ~5.2M commits and trees, so this is a larger effect than I expected (the 2^23 case above only improved by 50s or so, but here we gained almost 90s).
This is probably due to actually performing more object lookups in a real import with trees and commits, as opposed to just dumping a bunch of blobs into a pack.
Git 2.37 (Q3 2022) 展示了如何检查object_entry
的大小。
请参见提交14deb58(2022年6月28日),作者为Taylor Blau(ttaylorr
)。
(由Junio C Hamano -- gitster
--于提交b59f04f中合并,2022年7月13日)
签名作者:Taylor Blau
The size and padding of struct
object_entry`` is an important factor in determining the memory usage of pack-objects
.
For this reason, 3b13a5f (pack-objects
: reorder members to shrink struct object_entry, 2018-04-14, Git v2.18.0-rc0 -- merge listed in batch #6) (pack-objects: reorder members to shrink struct object_entry,
2018-04-14) added a comment containing some information from pahole
indicating the size and padding of that struct.
Unfortunately, this comment hasn't been updated since 9ac3f0e ("pack-objects
: fix performance issues on packing large deltas", 2018-07-22, Git v2.19.0-rc1 -- merge), despite the size of this struct changing many times since that commit.
To see just how often the size of object_entry
changes, I skimmed the first-parent history with this script:
for sha in $(git rev-list --first-parent --reverse 9ac3f0e..)
do
echo -n "$sha "
git checkout -q $sha
make -s pack-objects.o 2>/dev/null
pahole -C object_entry pack-objects.o | sed -n \
-e 's/\/\* size: \([0-9]*\).*/size \1/p' \
-e 's/\/\*.*padding: \([0-9]*\).*/padding \1/p' | xargs
done | uniq -f1
In between each merge, the size of object_entry
changes too often to record every instance here.
But the important merges (along with their corresponding sizes and bit paddings) in chronological order are:
ad635e82d6 (Merge branch 'nd/pack-objects-pack-struct', 2018-05-23) size 80 padding 4
29d9e3e2c4 (Merge branch 'nd/pack-deltify-regression-fix', 2018-08-22) size 80 padding 9
3ebdef2e1b (Merge branch 'jk/pack-delta-reuse-with-bitmap', 2018-09-17) size 80 padding 8
33e4ae9c50 (Merge branch 'bc/sha-256', 2019-01-29) size 96 padding 8
(indicating that the current size of the struct is 96 bytes, with 8 padding bits).
Even though this comment was written in a good spirit, it is updated infrequently enough that it serves to confuse rather than to encourage contributors to update the appropriate values when the modify the definition of object_entry
.
For that reason, eliminate the confusion by removing the comment altogether.
git add
运行更快。 - Tim Biegeleisen