`git clean` 默认会删除被忽略的文件吗?

13

根据帮助文档,没有-x选项时,git clean应该会忽略被忽略的文件,但它实际上并没有这样做。

[il@reallin test]$ cat .gitignore
*.sar
[il@reallin test]$ mkdir -p conf/sar && touch conf/sar/aaa.sar
[il@reallin test]$ git status
# On branch master
nothing to commit, working directory clean
[il@reallin test]$ git clean -df
Removing conf/

conf/sar/aaa.sar被删除了。这是一个错误吗?

5个回答

8
根据 man git clean 的说明:
-d
    Remove untracked directories in addition to untracked files.

在您的情况下,目录conf/sar没有被跟踪 - 它不包含任何被git跟踪的文件。 如果您没有gitignore规则并执行了git clean -fd,那么该未跟踪目录的内容将被删除 - 正如文档所述。
现在,如果您添加了.gitignore并规则忽略*.sar文件,它并不改变您的目录conf/sar/仍然没有被跟踪,并且有一个符合此gitignore规则的未跟踪文件aaa.sar不应该突然使它无法通过git clean -fd来移除。
但是,如果您在忽略的aaa.sar旁边添加任何跟踪文件,则不会删除此目录,而且您的文件将保持不变。
换句话说,虽然看起来很难理解,但这不是一个错误,git正是按照文档所说的做到了这一点。

2
顺便提一下:最好在sar目录本身中放置一个.gitignore来忽略*.sar文件。这不仅使顶级.gitignore更加清晰,保持了所需的忽略信息,而且还有一个额外的好处,就是像@mvp提到的那样保持该目录的活跃性。 - Shahbaz

5
警告:这个 git clean 行为在 Git 2.14 (Q3 2017) 中会略微改变。
"git clean -d" 以前会清除包含被忽略文件的目录,即使没有 "-x" 参数也不会丢失被忽略的文件。
"git status --ignored" 没有 "-uall" 参数时不会列出被忽略和未跟踪的文件。

请参见提交 6b1db43(2017年5月23日),以及提交 bbf504a, 提交 fb89888, 提交 df5bcdf, 提交 0a81d4a, 提交 b3487cc(2017年5月18日),作者为Samuel Lijin (sxlijin)
(由Junio C Hamano -- gitster --提交 f4fd99b中合并,2017年6月2日)

clean: 教clean -d保留被忽略的路径

有一个假设,即仅包含未跟踪和被忽略路径的目录本身应被视为未跟踪。这在我们询问是否应将目录添加到git数据库的用例中是有意义的,但不适用于我们询问是否可以安全地从工作树中删除目录的情况;因此,clean -d会认为包含被忽略路径的“未跟踪”目录可以被删除,即使这样做也会删除被忽略的路径。

为了解决这个问题,我们教clean -d收集被忽略的路径,并跳过包含被忽略路径的未跟踪目录,而只删除其中未跟踪的内容。
为实现这一点,cmd_clean()必须收集所有未跟踪目录的未跟踪内容,以及所有被忽略的路径,以确定哪些未跟踪目录必须被跳过(因为它们包含被忽略的路径),哪些未跟踪目录不应该被跳过。

但是...自2017年以来,这个变化意味着git status --ignored会无限期挂起
正如Martin Melka这个帖子中所报道的,并由SZEDER Gábor分析:

如果有120个目录,完成需要超过6*10^23年的深度。

这种减速是由提交df5bcdf引起的,它是修复'git clean -d'删除未跟踪目录的一系列补丁之一,即使它们包含被忽略的文件。

所以...正在进行修复,将在2020年晚些时候发布。


Git 2.24 (2019年第四季度)展示了git clean行为的变化,引入了一个回归问题。

请参见提交502c386(2019年8月25日)由SZEDER Gábor (szeder)撰写。
(由Junio C Hamano -- gitster --提交026428c中合并,2019年9月30日)

t7300-clean:演示删除嵌套仓库时忽略文件损坏

'git clean -fd'不能删除属于另一个Git仓库或工作树的未跟踪目录。

不幸的是,如果外部仓库中的'.gitignore'规则恰好匹配嵌套仓库或工作树中的文件,则会出现一些问题,'git clean -fd'会删除嵌套仓库工作树中除被忽略的文件之外的所有内容,可能导致数据丢失。

添加一个测试到't7300-clean.sh'以演示此故障。

这个问题是在6b1db43引入的(clean:教clean -d保留被忽略的路径,2017-05-23,Git v2.13.2)。


Git 2.24 进一步澄清了 git clean -d 命令:

请参见提交 69f272b(2019年10月1日),以及提交 902b90c提交 ca8b539提交 09487f2提交 e86bbcf提交 3aca580提交 29b577b提交 89a1f4a提交 a3d89d8提交 404ebce提交 a5e916c提交 bbbb6b0提交 7541cc5(2019年9月17日),作者为Elijah Newren (newren)
(由Junio C Hamano -- gitster --提交 aafb754合并,2019年10月11日)

t7300: add testcases showing failure to clean specified pathspecs

Someone brought me a testcase where multiple git-clean invocations were required to clean out unwanted files:

mkdir d{1,2}
touch d{1,2}/ut
touch d1/t && git add d1/t

With this setup, the user would need to run

git clean -ffd */ut

twice to delete both ut files.

A little testing showed some interesting variants:

  • If only one of those two ut files existed (either one), then only one clean command would be necessary.
  • If both directories had tracked files, then only one git clean would be necessary to clean both files.
  • If both directories had no tracked files then the clean command above would never clean either of the untracked files despite the pathspec explicitly calling both of them out.

A bisect showed that the failure to clean out the files started with commit cf424f5 ("clean: respect pathspecs with "-d", 2014-03-10, Git v1.9.1).
However, that pointed to a separate issue: while the "-d" flag was used by the original user who showed me this problem, that flag should have been irrelevant to this problem.
Testing again without the "-d" flag showed that the same buggy behavior exists without using that flag, and has in fact existed since before cf424f5.

所以:

clean: 通过 "-d" 尊重路径规范

git-clean 使用 read_directory 来填充一个 struct dir,其中可能包含匹配项。但是,read_directory 实际上并不检查我们的路径规范。它使用了一个简化版本,可能会出现误报。因此,我们需要检查任何匹配项是否符合我们的路径规范。

对于非目录,我们可以可靠地进行检查。

对于目录,如果没有给出 "-d",我们将检查路径规范是否完全匹配(即,我们更加严格,要求显式 "git clean foo" 来清除 "foo/")。但是,如果给出了 "-d",而不是放宽精确匹配以允许递归匹配,我们根本不检查路径规范。

这个回归是在113f10f中引入的(使 git-clean 成为内置命令,2007-11-11,Git v1.5.4-rc0)。

dir: 如果我们的路径规范可能匹配到目录下的文件,则递归进入该目录

对于 git clean,如果一个目录完全未跟踪且用户没有指定 -d(对应于 DIR_SHOW_IGNORED_TOO),那么我们通常不希望删除该目录,因此不会递归进入该目录。

但是,如果用户在该目录下的某个地方手动指定了特定的(甚至是通配符)路径以便删除,则我们需要递归进入该目录,以确保按照用户请求删除该目录下的相关路径。

请注意,这并不意味着递归进入的目录将被添加到 dir->entries 中以供稍后删除;作为本系列中几个提交之前的另一个更严格的匹配检查,在从递归进入的目录返回后运行,然后再决定是否将其添加到条目列表中。
因此,这只会导致与路径规范匹配的给定目录下的文件被添加到条目列表中。

并且:

dir: also check directories for matching pathspecs

Even if a directory doesn't match a pathspec, it is possible, depending on the precise pathspecs, that some file underneath it might.
So we special case and recurse into the directory for such situations.
However, we previously always added any untracked directory that we recursed into to the list of untracked paths, regardless of whether the directory itself matched the pathspec.

For the case of git-clean and a set of pathspecs of "dir/file" and "more", this caused a problem because we'd end up with dir entries for both of:

"dir"
"dir/file"

Then correct_untracked_entries() would try to helpfully prune duplicates for us by removing "dir/file" since it's under "dir", leaving us with

"dir"

Since the original pathspec only had "dir/file", the only entry left doesn't match and leaves nothing to be removed.
(Note that if only one pathspec was specified, e.g. only "dir/file", then the common_prefix_len optimizations in fill_directory would cause us to bypass this problem, making it appear in simple tests that we could correctly remove manually specified pathspecs.)

Fix this by actually checking whether the directory we are about to add to the list of dir entries actually matches the pathspec; only do this matching check after we have already returned from recursing into the directory.

那将导致:

clean: disambiguate the definition of -d

The -d flag pre-dated git-clean's ability to have paths specified.
As such, the default for git-clean was to only remove untracked files in the current directory, and -d existed to allow it to recurse into subdirectories.

The interaction of paths and the -d option appears to not have been carefully considered, as evidenced by numerous bugs and a dearth of tests covering such pairings in the testsuite.
The definition turns out to be important, so let's look at some of the various ways one could interpret the -d option:

A) Without -d, only look in subdirectories which contain tracked files under them; with -d, also look in subdirectories which are untracked for files to clean.

B) Without specified paths from the user for us to delete, we need to have some kind of default, so...without -d, only look in subdirectories which contain tracked files under them; with -d, also look in subdirectories which are untracked for files to clean.

The important distinction here is that choice B says that the presence or absence of '-d' is irrelevant if paths are specified.
The logic behind option B is that if a user explicitly asked us to clean a specified pathspec, then we should clean anything that matches that pathspec.

Some examples may clarify.

Should:

git clean -f untracked_dir/file

remove untracked_dir/file or not?
It seems crazy not to, but a strict reading of option A says it shouldn't be removed.
How about:

git clean -f untracked_dir/file1 tracked_dir/file2

or

git clean -f untracked_dir_1/file1 untracked_dir_2/file2

?
Should it remove either or both of these files?
Should it require multiple runs to remove both the files listed? (If this sounds like a crazy question to even ask, see the commit message of "t7300: Add some testcases showing failure to clean specified pathspecs" added earlier in this patch series.)
What if -ffd were used instead of -f -- should that allow these to be removed? Should it take multiple invocations with -ffd?
What if a glob (such as 'tracked') were used instead of spelling out the directory names?
What if the filenames involved globs, such as

git clean -f '*.o'

or

git clean -f '*/*.o'

?

The current documentation actually suggests a definition that is slightly different than choice A, and the implementation prior to this series provided something radically different than either choices A or B.
(The implementation, though, was clearly just buggy).

There may be other choices as well.
However, for almost any given choice of definition for -d that I can think of, some of the examples above will appear buggy to the user.
The only case that doesn't have negative surprises is choice B: treat a user-specified path as a request to clean all untracked files which match that path specification, including recursing into any untracked directories.

Change the documentation and basic implementation to use this definition.

There were two regression tests that indirectly depended on the current implementation, but neither was about subdirectory handling.
These two tests were introduced in commit 5b7570c ("git-clean: add tests for relative path", 2008-03-07, Git v1.5.5-rc0) which was solely created to add coverage for the changes in commit fb328947c8e ("git-clean: correct printing relative path", 2008-03-07).
Both tests specified a directory that happened to have an untracked subdirectory, but both were only checking that the resulting printout of a file that was removed was shown with a relative path.
Update these tests appropriately.

最后,查看 "Git clean排除嵌套子目录"。
警告:目录遍历代码存在冗余的递归调用,使其性能特征随着树的深度呈指数增长,这已在 Git 2.27(2020年第二季度)中得到修正。
而且这也影响到了 `git clean`。

请查看提交记录c0af173, 提交记录95c11ec, 提交记录7f45ab2, 提交记录1684644, 提交记录8d92fb2, 提交记录2df179d, 提交记录0126d14, 提交记录cd129ee, 提交记录446f46d, 提交记录7260c7b, 提交记录ce5c61a (2020年4月1日),作者为Elijah Newren (newren)
请查看提交记录0bbd0e8 (2020年4月1日),作者为Derrick Stolee (derrickstolee)
(由Junio C Hamano -- gitster --提交记录6eacc39中合并,2020年4月29日)

dir: replace exponential algorithm with a linear one

Signed-off-by: Elijah Newren

dir's read_directory_recursive() naturally operates recursively in order to walk the directory tree.

Treating of directories is sometimes weird because there are so many different permutations about how to handle directories.

Some examples:

  • 'git ls-files -o --directory' only needs to know that a directory itself is untracked; it doesn't need to recurse into it to see what is underneath.
  • 'git status' needs to recurse into an untracked directory, but only to determine whether or not it is empty.
    If there are no files underneath, the directory itself will be omitted from the output.
    If it is not empty, only the directory will be listed.
  • 'git status --ignored' needs to recurse into untracked directories and report all the ignored entries and then report the directory as untracked -- UNLESS all the entries under the directory are ignored, in which case we don't print any of the entries under the directory and just report the directory itself as ignored.
    (Note that although this forces us to walk all untracked files underneath the directory as well, we strip them from the output, except for users like 'git clean' who also set DIR_KEEP_TRACKED_CONTENTS.)
  • For 'git clean', we may need to recurse into a directory that doesn't match any specified pathspecs, if it's possible that there is an entry underneath the directory that can match one of the pathspecs.
    In such a case, we need to be careful to omit the directory itself from the list of paths (see commit 404ebceda01c ("dir: also check directories for matching pathspecs", 2019-09-17, Git v2.24.0-rc0))

Part of the tension noted above is that the treatment of a directory can change based on the files within it, and based on the various settings in dir->flags.

Trying to keep this in mind while reading over the code, it is easy to think in terms of "treat_directory() tells us what to do with a directory, and read_directory_recursive() is the thing that recurses".

Since we need to look into a directory to know how to treat it, though, it is quite easy to decide to (also) recurse into the directory from treat_directory() by adding a read_directory_recursive() call.

Adding such a call is actually fine, IF we make sure that read_directory_recursive() does not also recurse into that same directory.

Unfortunately, commit df5bcdf83aeb ("dir: recurse into untracked dirs for ignored files", 2017-05-18, Git v2.14.0-rc0 -- merge listed in batch #5), added exactly such a case to the code, meaning we'd have two calls to read_directory_recursive() for an untracked directory.

So, if we had a file named

one/two/three/four/five/somefile.txt

and nothing in one/ was tracked, then 'git status --ignored' would call read_directory_recursive() twice on the directory 'one/', and each of those would call read_directory_recursive() twice on the directory 'one/two/', and so on until read_directory_recursive() was called 2^5 times for 'one/two/three/four/five/'.

Avoid calling read_directory_recursive() twice per level by moving a lot of the special logic into treat_directory().

Since dir.c is somewhat complex, extra cruft built up around this over time.

While trying to unravel it, I noticed several instances where the first call to read_directory_recursive() would return e.g. path_untracked for some directory and a later one would return e.g. path_none, despite the fact that the directory clearly should have been considered untracked.

The code happened to work due to the side-effect from the first invocation of adding untracked entries to dir->entries; this allowed it to get the correct output despite the supposed override in return value by the later call.

I am somewhat concerned that there are still bugs and maybe even testcases with the wrong expectation.

I have tried to carefully document treat_directory() since it becomes more complex after this change (though much of this complexity came from elsewhere that probably deserved better comments to begin with).

However, much of my work felt more like a game of whackamole while attempting to make the code match the existing regression tests than an attempt to create an implementation that matched some clear design.

That seems wrong to me, but the rules of existing behavior had so many special cases that I had a hard time coming up with some overarching rules about what correct behavior is for all cases, forcing me to hope that the regression tests are correct and sufficient.

Such a hope seems likely to be ill-founded, given my experience with dir.c-related testcases in the last few months:

Examples where the documentation was hard to parse or even just wrong:

  • 3aca58045f4f (git-clean.txt: do not claim we will delete files with -n/--dry-run, 2019-09-17, Git v2.24.0-rc0)
  • 09487f2cbad3 (clean: avoid removing untracked files in a nested git repository, 2019-09-17, v2.24.0-rc0)
  • e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17)

Examples where testcases were declared wrong and changed:

  • 09487f2cbad3 (clean: avoid removing untracked files in a nested git repository, 2019-09-17, Git v2.24.0-rc0)
  • e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17, Git v2.24.0-rc0)
  • a2b13367fe55 (Revert "dir.c: make 'git-status --ignored' work within leading directories", 2019-12-10, Git v2.25.0-rc0)

Examples where testcases were clearly inadequate:

  • 502c386ff944 (t7300-clean: demonstrate deleting nested repo with an ignored file breakage, 2019-08-25, Git v2.24.0-rc0)
  • 7541cc530239 (t7300: add testcases showing failure to clean specified pathspecs, 2019-09-17, Git v2.24.0-rc0)
  • a5e916c7453b (dir: fix off-by-one error in match_pathspec_item, 2019-09-17, Git v2.24.0-rc0)
  • 404ebceda01c (dir: also check directories for matching pathspecs, 2019-09-17, Git v2.24.0-rc0)
  • 09487f2cbad3 (clean: avoid removing untracked files in a nested git repository, 2019-09-17, Git v2.24.0-rc0)
  • e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17, Git v2.24.0-rc0)
  • 452efd11fbf6 (t3011: demonstrate directory traversal failures, 2019-12-10, Git v2.25.0-rc0)
  • b9670c1f5e6b (dir: fix checks on common prefix directory, 2019-12-19, Git v2.25.0-rc0)

Examples where "correct behavior" was unclear to everyone:

其他值得注意的提交:

  • 902b90cf42bc (clean: 修复理论上可能出现的路径损坏,2019-09-17, Git v2.24.0-rc0)

However, on the positive side, it does make the code much faster.

For the following simple shell loop in an empty repository:

for depth in $(seq 10 25)
do
  dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
  rm -rf dir
  mkdir -p $dirs

$dirs/untracked-file /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null done

I saw the following timings, in seconds (note that the numbers are a little noisy from run-to-run, but the trend is very clear with every run):

10: 0.03
11: 0.05
12: 0.08
13: 0.19
14: 0.29
15: 0.50
16: 1.05
17: 2.11
18: 4.11
19: 8.60
20: 17.55
21: 33.87
22: 68.71
23: 140.05
24: 274.45
25: 551.15

For the above run, using strace I can look for the number of untracked directories opened and can verify that it matches the expected 2^($depth+1)-2 (the sum of 2^1 + 2^2 + 2^3 + ... + 2^$depth).

After this fix, with strace I can verify that the number of untracked directories that are opened drops to just $depth, and the timings all drop to 0.00.

In fact, it isn't until a depth of 190 nested directories that it sometimes starts reporting a time of 0.01 seconds and doesn't consistently report 0.01 seconds until there are 240 nested directories. The previous code would have taken

17.55 * 2^220 / (60*60*24*365) = 9.4 * 10^59 YEARS

to have completed the 240 nested directories case.

It's not often that you get to speed something up by a factor of 3*10^69.


0
为了获得所需的行为,即保护未跟踪目录免受git clean -d的影响,并有选择地从这些未跟踪的目录中删除内容,您必须明确忽略整个最顶层的未跟踪目录,在您的情况下。
echo /conf/ >>.gitignore   # or .git/info/excludes if it's just you

现在,git clean不会递归进入未跟踪的目录,但幸运的是这很容易手动完成:
# recursive x-ray git clean with various options:

git ls-files --exclude-standard '-x!*/' -oz  | xargs -0 rm -f   #
git ls-files                            -oz  | xargs -0 rm -f   # -x
git ls-files --exclude-standard '-x!*/' -oiz | xargs -0 rm -f   # -X

(或者使用git ls-files --exclude-standard '-x!/conf/'来跳过一个特定的文件)。单引号是因为!是交互式shell语法,用于拉取先前命令行的一部分。

要清理空目录,您可以使用以下方法接近所需的行为:

find -depth -type d -empty -delete
# -delete is -exec rm -f '{}' ';' on non-GNU userlands

但这确实应该放在一个makefile配方中,紧随其后的是一批mkdir -p,以重新创建任何你想要保留的结构,即使是空的,因为make被构建来管理像构建/测试/安装产品这样的瞬态。


"git完全忽略conf/sar/" - 错误。conf仍然被删除。 - basin
"git clean -f" - 它什么也不做。 - basin
你说得对,我忽略了/conf/也是未跟踪的,并且显然从未在我期望它递归的目录中使用过没有-dgit clean -f。现在好了吗? - jthill

0

是的,git clean 确实与文档相反,即使没有指定 -x/-X,也会删除被忽略的文件。

似乎选项 -d 覆盖了缺少 -x/-X。也就是说,git clean -df 将删除未跟踪的目录,即使它们包含未跟踪但被忽略的文件

我不知道这是疏忽还是故意的,但手册在这方面显然是不完整的。您可以考虑向 git 邮件列表发送手册补丁。

顺便说一下,同样的问题在问题 如何在 git clean -fd 中保留所有被忽略的文件? 中讨论过。在那里指出,git clean -df 不会删除在 .gitignore 中的目录。因此,要保留您的 conf/,可以将其添加到 .gitignore 中。


-1
除了我之前提到的git clean修复,Git 2.28(2020年第三季度)的代码清理工作还修复了最近性能下降的问题。在这个版本中,"git clean"得到了优化。

请查看 提交 7233f17, 提交 f7f5c6c, 提交 351ea1c, 提交 e6c0be9 (2020年6月11日) 由 Elijah Newren (newren) 提交。
(由 Junio C Hamano -- gitster -- 合并于 提交 5367469, 2020年6月25日)

clean:优化和记录我们递归进入子目录的情况

报告者:Brian Malehorn
签署者:Elijah Newren

Commit 6b1db43109 ("clean: teach clean -d to preserve ignored paths", 2017-05-23, Git v2.14.0-rc0 -- merge listed in batch #5) added the following code block (among others) to git-clean:

if (remove_directories)
    dir.flags |= DIR_SHOW_IGNORED_TOO | DIR_KEEP_UNTRACKED_CONTENTS;

The reason for these flags is well documented in the commit message, but isn't obvious just from looking at the code.

Add some explanations to the code to make it clearer.

Further, it appears git-2.26 did not correctly handle this combination of flags from git clean.

With both these flags and without DIR_SHOW_IGNORED_TOO_MODE_MATCHING set, git is supposed to recurse into all untracked AND ignored directories.

git-2.26.0 clearly was not doing that.

I don't know the full reasons for that or whether git < 2.27.0 had additional unknown bugs because of that misbehavior, because I don't feel it's worth digging into.

As per the huge changes and craziness documented in commit 8d92fb2927 ("dir: replace exponential algorithm with a linear one", 2020-04-01, Git v2.27.0-rc0 -- merge listed in batch #5), the old algorithm was a mess and was thrown out.

What I can say is that git-2.27.0 correctly recurses into untracked AND ignored directories with that combination.

However, in clean's case we don't need to recurse into ignored directories; that is just a waste of time.

Thus, when git-2.27.0 started correctly handling those flags, we got a performance regression report.

Rather than relying on other bugs in fill_directory()'s former logic to provide the behavior of skipping ignored directories, make use of the DIR_SHOW_IGNORED_TOO_MODE_MATCHING value specifically added in commit eec0f7f2b7 ("status: add option to show ignored files differently", 2017-10-30, Git v2.16.0-rc0 -- merge listed in batch #4) for this purpose.


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接