git gc --aggressive与git repack的区别

122
我正在寻找减小 git 仓库大小的方法。搜索结果大多数时候会提到使用 git gc --aggressive 命令。但我也读到这不是首选方法。为什么?如果我运行 gc --aggressive,有哪些需要注意的地方?
推荐使用 git repack -a -d --depth=250 --window=250,而不是 gc --aggressive。为什么?repack 如何减小仓库大小?另外,我对 --depth--window 标志不是很清楚。
gcrepack 之间应该选择哪个?什么情况下应该使用 gcrepack
5个回答

101
现在已经没有区别了:git gc --aggressive 操作遵循了Linus在2007年提出的建议;请参见下文。从2.11版本(2016年第四季度)开始,git默认深度为50。250的窗口大小很好,因为它可以扫描每个对象的较大部分,但是250的深度不好,因为它使每个链都引用非常深的旧对象,这会降低所有未来的git操作,以换取略微更少的磁盘使用。

历史背景

Linus建议(见下面完整的邮件列表帖子),只有在您拥有“非常糟糕的打包”或“非常可怕的增量”时才使用git gc --aggressive,但是“几乎总是,在其他情况下,这实际上是一件非常糟糕的事情。” 结果甚至可能使您的存储库比开始时更糟糕!

在导入“长而复杂的历史记录”后,他建议使用以下命令来正确执行此操作:

git repack -a -d -f --depth=250 --window=250

但是这假设你已经从你的仓库历史记录中移除了不需要的垃圾,并且你已经按照 git filter-branch 文档 中收缩仓库的清单进行操作。

git-filter-branch can be used to get rid of a subset of files, usually with some combination of --index-filter and --subdirectory-filter. People expect the resulting repository to be smaller than the original, but you need a few more steps to actually make it smaller, because Git tries hard not to lose your objects until you tell it to. First make sure that:

  • You really removed all variants of a filename, if a blob was moved over its lifetime. git log --name-only --follow --all -- filename can help you find renames.

  • You really filtered all refs: use --tag-name-filter cat -- --all when calling git filter-branch.

Then there are two ways to get a smaller repository. A safer way is to clone, that keeps your original intact.

  • Clone it with git clone file:///path/to/repo. The clone will not have the removed objects. See git-clone. (Note that cloning with a plain path just hardlinks everything!)

If you really don’t want to clone it, for whatever reasons, check the following points instead (in this order). This is a very destructive approach, so make a backup or go back to cloning it. You have been warned.

  • Remove the original refs backed up by git-filter-branch: say

    git for-each-ref --format="%(refname)" refs/original/ |
      xargs -n 1 git update-ref -d
    
  • Expire all reflogs with git reflog expire --expire=now --all.

  • Garbage collect all unreferenced objects with git gc --prune=now (or if your git gc is not new enough to support arguments to --prune, use git repack -ad; git prune instead).


Date: Wed, 5 Dec 2007 22:09:12 -0800 (PST)
From: Linus Torvalds <torvalds at linux-foundation dot org>
To: Daniel Berlin <dberlin at dberlin dot org>
cc: David Miller <davem at davemloft dot net>,
    ismail at pardus dot org dot tr,
    gcc at gcc dot gnu dot org,
    git at vger dot kernel dot org
Subject: Re: Git and GCC
In-Reply-To: <4aca3dc20712052111o730f6fb6h7a329ee811a70f28@mail.gmail.com>
Message-ID: <alpine.LFD.0.9999.0712052132450.13796@woody.linux-foundation.org>
References: <4aca3dc20712051947t5fbbb383ua1727c652eb25d7e@mail.gmail.com>
            <20071205.202047.58135920.davem@davemloft.net>
            <4aca3dc20712052032n521c344cla07a5df1f2c26cb8@mail.gmail.com>
            <20071205.204848.227521641.davem@davemloft.net>
            <4aca3dc20712052111o730f6fb6h7a329ee811a70f28@mail.gmail.com>

On Thu, 6 Dec 2007, Daniel Berlin wrote:

Actually, it turns out that git-gc --aggressive does this dumb thing to pack files sometimes regardless of whether you converted from an SVN repo or not.

Absolutely. git --aggressive is mostly dumb. It’s really only useful for the case of “I know I have a really bad pack, and I want to throw away all the bad packing decisions I have done.”

To explain this, it’s worth explaining (you are probably aware of it, but let me go through the basics anyway) how git delta-chains work, and how they are so different from most other systems.

In other SCMs, a delta-chain is generally fixed. It might be “forwards” or “backwards,” and it might evolve a bit as you work with the repository, but generally it’s a chain of changes to a single file represented as some kind of single SCM entity. In CVS, it’s obviously the *,v file, and a lot of other systems do rather similar things.

Git also does delta-chains, but it does them a lot more “loosely.” There is no fixed entity. Deltas are generated against any random other version that git deems to be a good delta candidate (with various fairly successful heuristics), and there are absolutely no hard grouping rules.

This is generally a very good thing. It’s good for various conceptual reasons (i.e., git internally never really even needs to care about the whole revision chain — it doesn’t really think in terms of deltas at all), but it’s also great because getting rid of the inflexible delta rules means that git doesn’t have any problems at all with merging two files together, for example — there simply are no arbitrary *,v “revision files” that have some hidden meaning.

It also means that the choice of deltas is a much more open-ended question. If you limit the delta chain to just one file, you really don’t have a lot of choices on what to do about deltas, but in git, it really can be a totally different issue.

And this is where the really badly named --aggressive comes in. While git generally tries to re-use delta information (because it’s a good idea, and it doesn’t waste CPU time re-finding all the good deltas we found earlier), sometimes you want to say “let’s start all over, with a blank slate, and ignore all the previous delta information, and try to generate a new set of deltas.”

So --aggressive is not really about being aggressive, but about wasting CPU time re-doing a decision we already did earlier!

Sometimes that is a good thing. Some import tools in particular could generate really horribly bad deltas. Anything that uses git fast-import, for example, likely doesn’t have much of a great delta layout, so it might be worth saying “I want to start from a clean slate.”

But almost always, in other cases, it’s actually a really bad thing to do. It’s going to waste CPU time, and especially if you had actually done a good job at deltaing earlier, the end result isn’t going to re-use all those good deltas you already found, so you’ll actually end up with a much worse end result too!

I’ll send a patch to Junio to just remove the git gc --aggressive documentation. It can be useful, but it generally is useful only when you really understand at a very deep level what it’s doing, and that documentation doesn’t help you do that.

Generally, doing incremental git gc is the right approach, and better than doing git gc --aggressive. It’s going to re-use old deltas, and when those old deltas can’t be found (the reason for doing incremental GC in the first place!) it’s going to create new ones.

On the other hand, it’s definitely true that an “initial import of a long and involved history” is a point where it can be worth spending a lot of time finding the really good deltas. Then, every user ever after (as long as they don’t use git gc --aggressive to undo it!) will get the advantage of that one-time event. So especially for big projects with a long history, it’s probably worth doing some extra work, telling the delta finding code to go wild.

So the equivalent of git gc --aggressive — but done properly — is to do (overnight) something like

git repack -a -d --depth=250 --window=250

where that depth thing is just about how deep the delta chains can be (make them longer for old history — it’s worth the space overhead), and the window thing is about how big an object window we want each delta candidate to scan.

And here, you might well want to add the -f flag (which is the “drop all old deltas,” since you now are actually trying to make sure that this one actually finds good candidates.

And then it’s going to take forever and a day (i.e., a “do it overnight” thing). But the end result is that everybody downstream from that repository will get much better packs, without having to spend any effort on it themselves.

          Linus

4
你关于深度的评论有些令人困惑。一开始我本来要抱怨你完全错了,因为“激进模式”确实可以显著加速git存储库。在进行了一次激进垃圾回收后,一个耗时五分钟才能完成git状态检查的巨大存储库缩短到了几秒钟。但后来我意识到你没有指激进的垃圾回收会拖慢存储库,而是指极其深的深度大小会产生这种效果。 - user6856

71

何时应该使用gc和repack?

正如我在“Git垃圾收集似乎没有完全工作”中提到的那样,git gc --aggressive既不足够也不足够独立。
而且,如我下面所解释的,通常是不需要的。

最有效的组合将是添加git repack,但也包括git prune:

git gc
git repack -Ad      # kills in-pack garbage
git prune           # kills loose garbage

注意:Git 2.11 (2016年第四季度) 将默认将gc aggressive深度设置为50。
请参见commit 07e7dbf(2016年8月11日)由Jeff King (peff)提交。
(在commit 0952ca8中由Junio C Hamano -- gitster --合并,2016年9月21日)

gc:默认的“aggressive”深度为50

以前,"git gc --aggressive" 会将增量链长度限制为250,这对于获得额外的空间节省来说过于深入,而且对运行时性能有害。
现在将此限制降低到了50。

总之,目前默认的250并不能节省太多空间,反而会浪费CPU。这不是一个好的权衡。

"--aggressive" 标志对 git-gc 做了三件事情:

  1. 使用 "-f" 来扔掉现有的增量并从头重新计算
  2. 使用 "--window=250" 更加努力地查找增量
  3. 使用 "--depth=250" 来生成更长的增量链

项目(1)和(2)非常适合“aggressive” repack。
它们要求repack做更多的计算工作,以期获得更好的包。你在repack期间支付成本,其他操作只看到好处。

项目(3)则不是那么清楚。
允许更长的链意味着增量上的限制更少,这意味着可能找到更好的增量并节省一些空间。
但这也意味着访问增量的操作必须遵循更长的链,这会影响它们的性能。
因此这是一个权衡,并不清楚这个权衡是否是一个好的权衡。

(请参考提交以供学习

您可以看到,随着深度的减少,常规操作的CPU节省情况得到了改善。
但是我们也可以看到,随着深度的增加,空间节省并不那么明显。在10和50之间节省5-10%的空间可能值得CPU的权衡。从50到100节省1%,或者从100到250再节省0.5%可能就没有那么值得了。


谈到节省CPU,"git repack" 学会接受 --threads=<n> 选项并将其传递给 pack-objects。
参见 commit 40bcf31(由 Junio C Hamano (gitster) 于 2017 年 4 月 26 日提交)。 (由 Junio C Hamano -- gitster -- 合并于 commit 31fb6f4,2017 年 5 月 29 日)

repack: 接受 --threads=<n> 并将其传递给 pack-objects

我们已经为--window=<n>--depth=<n>执行此操作;这将有助于用户在进行可重复的测试并且不受多个线程竞争影响时,强制执行--threads=1。请保留HTML标签。

3
我在“Git垃圾回收似乎没有完全生效”链接中提到了Linus的讨论。 - VonC
2
感谢这次现代化的更新! 这里的其他所有答案都已经过时了。 现在我们可以看到 git gc --aggressive 已经被修复了两次:首先,按照Linus 2007年建议的“更好的打包方法”来执行。 然后在Git 2.11中避免了Linus建议的过度对象深度,这被证明是有害的(会减慢所有未来的Git操作,并且没有节省任何值得一提的空间)。 - user964843
git gc,接着 git repack -Ad 和 git prune 为什么会增加我的代码库大小? - devops
@devops 不确定:您使用的Git版本是什么?您可以提出一个新问题(包括更多细节,如操作系统、存储库的一般大小等)。 - VonC
man git-repack 中对于 -d 的说明是:Also run git prune-packed to remove redundant loose object files. 那么 git prune 呢?它是否也会这样做呢? man git-prune 说:在大多数情况下,用户应该运行 git gc,它会调用 git prune。那么在 git gc 之后还有什么用处呢?只使用 git repack -Ad && git gc 是否足够了呢? - Jakob
显示剩余3条评论

15

git gc --aggressive存在的问题在于选项名称和文档可能会误导。

正如Linus在这封邮件中所解释的那样git gc --aggressive基本上做的是:

虽然git通常尝试重用增量信息(因为这是一个好主意,而且不浪费CPU时间重新查找我们之前找到的所有良好增量),但有时您想说“让我们从头开始,使用一个空白状态,忽略所有先前的增量信息,并尝试生成一组新的增量。”

通常情况下,在git中无需重新计算增量,因为git可以非常灵活地确定这些增量。除非您知道您确实具有非常糟糕的增量,否则这没有任何意义。正如Linus所解释的,主要使用git fast-import的工具属于这个类别。

大多数情况下,git会很好地确定有用的增量,而使用git gc --aggressive会使您留下潜在更糟糕的增量,并浪费大量CPU时间。


Linus最后得出的结论是,在大多数情况下,使用具有较大--depth--windowgit repack是更好的选择;特别是在您导入一个大项目并希望确保git找到良好增量之后。

因此,git gc --aggressive的等效方式——但是要正确地执行——是(过夜)执行类似以下的操作:

git repack -a -d --depth=250 --window=250

其中深度是关于增量链可以有多深的(对于旧历史而言,使它们更长会占用更多空间开销),而窗口是每个增量候选项要扫描的对象窗口大小。

在这里,你可能会想要添加-f标志(即“删除所有旧增量”,因为现在你实际上正在确保此操作找到良好的候选项。


14

注意:如果您没有备份,请勿在与远程未同步的存储库上运行git gc --aggressive

此操作会从头开始重建增量,如果优雅地中止可能会导致数据丢失。

对于我的8GB计算机,10k个小提交的1Gb存储库运行过度的GC时内存不足。当OOM killer终止git进程时,它只留下了几个增量和工作树,存储库几乎为空。

当然,它不是存储库的唯一副本,因此我只是重新创建并从远程拉取(损坏的存储库上的提取无法工作,并且在“解析增量”步骤上死锁了几次尝试这样做),但是,如果您的存储库是单开发者本地存储库而根本没有任何远程,请先备份。


11
注意:谨慎使用git gc --aggressive,因为Git 2.22(2019年第二季度)进行了澄清。
请参阅commit 0044f77, commit daecbf2, commit 7384504, commit 22d4e3b, commit 080a448, commit 54d56f5, commit d257e0f, commit b6a8d09 (2019年4月7日) 和 commit fc559fb, commit cf9cd77, commit b11e856 (2019年3月22日)由Ævar Arnfjörð Bjarmason (avar)提交。
(由Junio C Hamano -- gitster --合并在commit ac70c53,2019年4月25日) 现有的“gc --aggressive”文档几乎建议用户定期运行它,但实际上这通常只是浪费时间。因此,让我们澄清它的真正作用,并让用户自己得出结论。同时,让我们澄清“The effects[...] are persistent”的含义,引用Jeff King的解释的简短版本。这意味着git-gc文档现在包括

激进模式

当使用--aggressive选项时,git-repack将带有-f标记调用,这将传递--no-reuse-deltagit-pack-objects
这将丢弃任何现有的增量并重新计算它们,代价是花费更多时间来重新打包。

这样做的影响大部分是持久性的,例如当将包和松散对象合并到另一个包中时,该包中现有的增量可能会被重复使用,但也有各种情况我们可能从新的包中选择次优的增量。

此外,提供--aggressive将调整传递给git-repack--depth--window选项。
请参见下面的gc.aggressiveDepthgc.aggressiveWindow设置。
通过使用较大的窗口大小,我们更有可能找到更优的增量。

在没有针对特定存储库运行定制的性能基准测试之前,可能不值得使用此选项
它需要更多的时间,而生成的空间/增量优化可能值得也可能不值得。对于大多数用户和他们的存储库来说,不使用这个选项是正确的折衷。

并且 (提交080a448):

gc文档:说明--aggressive如何影响--window--depth

自从07e7dbf (gc: 默认将深度设为50,2016-08-11,Git v2.10.1)以来,我们在--aggressive下使用了与默认设置相同的深度,这有点令人困惑。

正如在那个提交中所指出的那样,使更多的深度成为“aggressive”的默认选项是错误的,这样可以节省磁盘空间,但会牺牲运行时性能,而通常希望进行“aggressive gc”的人则恰恰相反。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接