维基百科解释了自动重命名检测:
简单来说,对于一个在修订版本N中的文件,其默认祖先是修订版本N−1中同名文件。然而,当修订版本N−1中没有同名文件时,Git会搜索一个仅存在于修订版本N−1中且与新文件非常相似的文件。
重命名检测显然归结为相似文件检测。该算法是否有记录?知道能够自动检测哪些类型的转换会很好。
维基百科解释了自动重命名检测:
简单来说,对于一个在修订版本N中的文件,其默认祖先是修订版本N−1中同名文件。然而,当修订版本N−1中没有同名文件时,Git会搜索一个仅存在于修订版本N−1中且与新文件非常相似的文件。
重命名检测显然归结为相似文件检测。该算法是否有记录?知道能够自动检测哪些类型的转换会很好。
Git追踪文件的内容而非文件名。因此,如果仅仅是重命名文件而不改变其内容,Git可以轻易地进行检测。 (Git不跟踪,但执行检测; 使用git mv
或git rm
和git add
等效于相同操作。)
当文件被添加到存储库中时,文件名在树对象中。实际文件内容作为二进制大对象(blob)添加到存储库中。 如果包含相同内容的其他文件,则Git不会为其添加另一个blob。实际上,Git无法这样做,因为内容存储在文件系统中,哈希的前两个字符是目录名,其余的是其中的文件名。 因此,检测重命名只需要比较哈希。
为了检测重命名文件的小改动,Git使用某些算法和阈值限制来判断是否是重命名。例如,请查看git diff
的-M
标志。还有一些配置值,如merge.renameLimit
(在合并期间执行重命名检测时要考虑的文件数)。
要了解Git如何处理相似文件(即,哪些文件转换被视为重命名),请探索上述提到的配置选项和标志。您无需考虑如何操作。要了解Git实际上是如何完成这些任务的,请查看查找文本差异的算法,并阅读Git源代码。
算法仅用于diff,merge和log的目的--它们不会影响Git的存储方式。文件内容中的任何小变化都意味着将添加一个新对象。在那个级别上不会有增量或差异发生。当然,稍后,这些对象可能会被打包到packfile中,但这与重命名检测无关。
git cherry-pick
更新了错误的文件,因为它错误地认为这是一个重命名而不是添加。不幸的是,我已经推送了更改,所以我不得不手动重新添加正确的文件。在我看来,Git 的重命名检测是一个愚蠢的概念 - 它应该坚持用户明确的重命名(就像 hg 一样)。 - Frank Schmitt那个算法有文档吗?
至少在Git 2.33(2021年第三季度)中,对“git diff -l<n>
”(man)和diff.renameLimit
的文档进行了更新,并且将这些限制的默认值提高了。
请查看提交 94b82d5, 提交 9dd29db, 提交 6623a52, 提交 05d2c61 (2021年7月15日) 由Elijah Newren (newren
)提交。
(由Junio C Hamano -- gitster
--于提交 268055b中合并,2021年7月28日)
重命名
: 再次提高限制默认值签名作者: Elijah Newren
These were last bumped in commit 92c57e5 ("bump rename limit defaults (again)", 2011-02-19, Git v1.7.5-rc0 -- merge), and were bumped both because processors had gotten faster, and because people were getting ugly merges that caused problems and reporting it to the mailing list (suggesting that folks were willing to spend more time waiting).
Since that time:
- Linus has continued recommending kernel folks to set diff.renameLimit=0 (maps to 32767, currently)
- Folks with repositories with lots of renames were happy to set
merge.renameLimit
above 32767, once the code supported that, to get correct cherry-picks- Processors have gotten faster
- It has been discovered that the timing methodology used last time probably used too large example files.
The last point is probably worth explaining a bit more:
- The "average" file size used appears to have been average blob size in the linux kernel history at the time (probably v2.6.25 or something close to it).
- Since bigger files are modified more frequently, such a computation weights towards larger files.
- Larger files may be more likely to be modified over time, but are not more likely to be renamed -- the mean and median blob size within a tree are a bit higher than the mean and median of blob sizes in the history leading up to that version for the linux kernel.
- The mean blob size in v2.6.25 was half the average blob size in history leading to that point
- The median blob size in v2.6.25 was about 40% of the mean blob size in v2.6.25.
- Since the mean blob size is more than double the median blob size, any file as big as the mean will not be compared to any files of median size or less (because they'd be more than 50% dissimilar).
- Since it is the number of files compared that provides the
O(n^2)
behavior, median-sized files should matter more than mean-sized ones.The combined effect of the above is that the file size used in past calculations was likely about 5x too large.
Combine that with a CPU performance improvement of ~30%, and we can increase the limits by a factor ofsqrt(5/(1-.3)) = 2.67
, while keeping the original stated time limits.Keeping the same approximate time limit probably makes sense for
diff.renameLimit
(there is no progress feedback in e.g.git log -p
(man)), but the experience above suggestsmerge.renameLimit
could be extended significantly.
In fact, it probably would make sense to have an unlimited default setting formerge.renameLimit
, but that would likely need to be coupled with changes to how progress is displayed.
(See https://lore.kernel.org/git/YOx+Ok%2FEYvLqRMzJ@coredump.intra.peff.net/ for details in that area.)
For now, let's just bump the approximate time limit from 10s to 1m.(Note: We do not want to use actual time limits, because getting results that depend on how loaded your system is that day feels bad, and because we don't discover that we won't get all the renames until after we've put in a lot of work rather than just upfront telling the user there are too many files involved.)
Using the original time limit of 2s for
diff.renameLimit
, and bumpingmerge.renameLimit
from 10s to 60s, I found the following timings using the simple script at the end of this commit message (on an AWSc5.xlarge
which reports as "Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz"):N Timing 0 1.995s 0 59.973s
So let's round down to nice even numbers and bump the limits from
400->1000,
and from1000->7000
.Here is the
measure_rename_perf
script (adapted from https://lore.kernel.org/git/20080211113516.GB6344@coredump.intra.peff.net/ in particular to avoid triggering the linear handling from basename-guided rename detection):
#!/bin/bash n=$1; shift rm -rf repo mkdir repo && cd repo git init -q -b main mkdata() { mkdir $1 for i in `seq 1 $2`; do (sed "s/^/$i /" <../sample echo tag: $1 ) >$1/$i done } mkdata initial $n git add . git commit -q -m initial mkdata new $n git add . cd new for i in *; do git mv $i $i.renamed; done cd .. git rm -q -rf initial git commit -q -m new time git diff-tree -M -l0 --summary HEAD^ HEAD
git config
现在在其手册页面中包含:
-l
。如果未设置,则默认值当前为1000。
git config
现在在其手册页面中包含:
当前默认为7000。
同时,Git 2.33(2021年第三季度)也有以下更新:
查看提交 94b82d5, 提交 9dd29db, 提交 6623a52, 提交 05d2c61 (2021年7月15日)由Elijah Newren (newren
)完成。
(由Junio C Hamano -- gitster
--在提交 268055b中合并,于2021年7月28日)
以下文档中的一些地方暗示了重命名/复制检测总是二次方的,或者所有(未成对的)文件都涉及到重命名/复制检测的二次方部分。
doc
: 澄清重命名/复制限制的文档签名作者:Elijah Newren
diffcore-rename
:基于基本名称指导不精确的重命名检测”,2021-02-14,Git v2.31.0-rc1 - 合并)git config
现在在其手册页面中包含以下内容:
在复制/重命名检测的详尽部分中要考虑的文件数量;等同于'git diff'选项
-l
。
如果未设置,则默认值当前为400。
如果关闭了重命名检测,则此设置无效。
git config
现在在其手册页面中包含以下内容:
在合并过程中考虑用于详尽检测重命名的文件数量。
如果未指定,则默认为
diff.renameLimit
的值。
如果未指定merge.renameLimit
或diff.renameLimit
,则当前默认为 1000。
如果关闭了重命名检测,则此设置无效。
diff-options
现在在其手册页面中包括:
-M
和-C
选项涉及一些预备步骤,可以便宜地检测到重命名/复制的子集,然后是一个详尽的回退部分,将所有剩余的未配对目标与所有相关源进行比较。
(对于重命名,只有剩余的未配对源是相关的;对于复制,所有原始源都是相关的。)对于N个源和目标,这个详尽的检查是
O(N^2)
。如果涉及的源/目标文件数量超过指定数量,则此选项会防止重命名/复制检测的详尽部分运行。
默认为diff.renameLimit
。