给定一个 blob 的哈希值,是否有一种方法可以获取包含此 blob 在其树中的提交列表?
git log
可理解的参数。例如:--all
代表在所有分支中搜索而不仅仅是当前分支,或者-g
代表在reflog中搜索,或者其他你想要的参数。
这是一个shell脚本 - 简短而简洁,但速度较慢:
#!/bin/sh
obj_name="$1"
shift
git log "$@" --pretty=tformat:'%T %h %s' \
| while read tree commit subject ; do
if git ls-tree -r $tree | grep -q "$obj_name" ; then
echo $commit "$subject"
fi
done
以下是优化后的Perl版本,代码仍然很短,但速度更快:
#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;
my $obj_name;
sub check_tree {
my ( $tree ) = @_;
my @subtree;
{
open my $ls_tree, '-|', git => 'ls-tree' => $tree
or die "Couldn't open pipe to git-ls-tree: $!\n";
while ( <$ls_tree> ) {
/\A[0-7]{6} (\S+) (\S+)/
or die "unexpected git-ls-tree output";
return 1 if $2 eq $obj_name;
push @subtree, $2 if $1 eq 'tree';
}
}
check_tree( $_ ) && return 1 for @subtree;
return;
}
memoize 'check_tree';
die "usage: git-find-blob <blob> [<git-log arguments ...>]\n"
if not @ARGV;
my $obj_short = shift @ARGV;
$obj_name = do {
local $ENV{'OBJ_NAME'} = $obj_short;
`git rev-parse --verify \$OBJ_NAME`;
} or die "Couldn't parse $obj_short: $!\n";
chomp $obj_name;
open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s'
or die "Couldn't open pipe to git-log: $!\n";
while ( <$log> ) {
chomp;
my ( $tree, $commit, $subject ) = split " ", $_, 3;
print "$commit $subject\n" if check_tree( $tree );
}
git rev-parse --verify $theprefix
命令。 - John Douthatmy $blob_arg = shift;
open my $rev_parse, '-|', git => 'rev-parse' => '--verify', $blob_arg or die "Couldn't open pipe to git-rev-parse: $!\n";
my $obj_name = <$rev_parse>;
chomp $obj_name;
close $rev_parse or die "Couldn't expand passed blob.\n";
$obj_name eq $blob_arg or print "(full blob is $obj_name)\n";
- Ingo Karkatobj_name="$1"
shift
git log --all --pretty=format:'%T %h %s %n' -- "$@" | while read tree commit cdate subject ; do
if [ -z $tree ] ; then
continue
fi
if git ls-tree -r $tree | grep -q "$obj_name" ; then
echo "$cdate $commit $@ $subject"
fi
done
- Mixologicgit log --raw --all --find-object=<blob hash>
$ git log --raw --all --find-object=b3bb59f06644
commit 8ef93124645f89c45c9ec3edd3b268b38154061a
⋮
diff: do not show submodule with untracked files as "-dirty"
⋮
:100644 100644 b3bb59f06644 8f6227c993a5 M submodule.c
commit 7091499bc0a9bccd81a1c864de7b5f87a366480e
⋮
Revert "submodules: fix of regression on fetching of non-init subsub-repo"
⋮
:100644 100644 eef5204e641e b3bb59f06644 M submodule.c
--raw
选项告诉git在输出行中包含修改前后的blob哈希值。git whatchanged
已经被半废弃,它基本上相当于 git log --raw --no-merges
,而后者并没有被半废弃。 - torek很不幸,对我来说脚本运行有点慢,所以我必须优化一下。幸运的是,我不仅有文件的哈希值,还有路径。
git log --all --pretty=format:%H -- <path> | xargs -I% sh -c "git ls-tree % -- <path> | grep -q <hash> && echo %"
<path>
中包含 <hash>
的最新提交,则从 git log
中删除 <path>
参数即可。第一个返回的结果就是所需的提交。 - Unapiedragit describe
, git log
和 git diff
现在也从"--find-object=<object-id>
"选项中受益,以限制涉及命名对象的更改的发现。stefanbeller
)。gitster
--在commit c0d75f0中合并,2018年1月23日)
diffcore
:添加一个拾取选项以查找特定的blobSometimes users are given a hash of an object and they want to identify it further (ex.: Use verify-pack to find the largest blobs, but what are these? Or this Stack Overflow question "Which commit has this blob?")
One might be tempted to extend
git-describe
to also work with blobs, such thatgit describe <blob-id>
gives a description as '<commit-ish>:<path>
'.
This was implemented here; as seen by the sheer number of responses (>110), it turns out this is tricky to get right.
The hard part to get right is picking the correct 'commit-ish' as that could be the commit that (re-)introduced the blob or the blob that removed the blob; the blob could exist in different branches.Junio hinted at a different approach of solving this problem, which this patch implements.
Teach thediff
machinery another flag for restricting the information to what is shown.
For example:$ ./git log --oneline --find-object=v2.0.0:Makefile b2feb64 Revert the whole "ask curl-config" topic for now 47fbfde i18n: only extract comments marked with "TRANSLATORS:"
we observe that the
Makefile
as shipped with2.0
was appeared inv1.9.2-471-g47fbfded53
and inv2.0.0-rc1-5-gb2feb6430b
.
The reason why these commits both occur prior to v2.0.0 are evil merges that are not found using this new mechanism.
正如marcono1234在评论中所指出的,你可以将其与git log --all选项结合使用:
当您不知道哪个分支包含该对象时,这可能非常有用。
git describe
将是一个很好的解决方案,因为它被教导深入挖掘树来查找引用给定 blob 对象的 <commit-ish>:<path>
。
请查看提交 644eb60, 提交 4dbc59a, 提交 cdaed0c, 提交 c87b653, 提交 ce5b6f9 (2017年11月16日),以及提交 91904f5, 提交 2deda00 (2017年11月02日)由Stefan Beller (stefanbeller
)。
(由Junio C Hamano -- gitster
--合并于提交 556de1a, 2017年12月28日)
那意味着
builtin/describe.c
: describe a blobSometimes users are given a hash of an object and they want to identify it further (ex.: Use
verify-pack
to find the largest blobs, but what are these? or this very SO question "Which commit has this blob?")When describing commits, we try to anchor them to tags or refs, as these are conceptually on a higher level than the commit. And if there is no ref or tag that matches exactly, we're out of luck.
So we employ a heuristic to make up a name for the commit. These names are ambiguous, there might be different tags or refs to anchor to, and there might be different path in the DAG to travel to arrive at the commit precisely.When describing a blob, we want to describe the blob from a higher layer as well, which is a tuple of
(commit, deep/path)
as the tree objects involved are rather uninteresting.
The same blob can be referenced by multiple commits, so how we decide which commit to use?This patch implements a rather naive approach on this: As there are no back pointers from blobs to commits in which the blob occurs, we'll start walking from any tips available, listing the blobs in-order of the commit and once we found the blob, we'll take the first commit that listed the blob.
For example:
git describe --tags v0.99:Makefile conversion-901-g7672db20c2:Makefile
tells us the
Makefile
as it was inv0.99
was introduced in commit 7672db2.The walking is performed in reverse order to show the introduction of a blob rather than its last occurrence.
git describe
手册 增加了此命令的目的:git describe
会在作为 git describe <blob>
使用时,基于可用的引用为对象提供一个易读的名称。<commit-ish>:<path>
,使得该 blob 可以在 <commit-ish>
中的 <path>
找到,而 <commit-ish>
本身描述了从 HEAD 开始的反向修订步骤中第一个包含此 blob 的提交。
树对象和标签对象未指向提交,无法描述。
在描述blob时,忽略指向blob的轻量级标签,但仍然将blob描述为<committ-ish>:<path>
,尽管轻量级标签更有利。
git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | awk '/^blob/ {print substr($0,6)}' | sort --numeric-sort --key=2 -r | head -n 20
结合使用效果很好,该命令会返回前20个最大的 blob。然后,您可以将上面输出中的 blob ID 传递给 git describe
命令。非常有效!谢谢! - Alexander Pogrebnyak#!/usr/bin/perl -w
use strict;
my @commits;
my %trees;
my $blob;
sub blob_in_tree {
my $tree = $_[0];
if (defined $trees{$tree}) {
return $trees{$tree};
}
my $r = 0;
open(my $f, "git cat-file -p $tree|") or die $!;
while (<$f>) {
if (/^\d+ blob (\w+)/ && $1 eq $blob) {
$r = 1;
} elsif (/^\d+ tree (\w+)/) {
$r = blob_in_tree($1);
}
last if $r;
}
close($f);
$trees{$tree} = $r;
return $r;
}
sub handle_commit {
my $commit = $_[0];
open(my $f, "git cat-file commit $commit|") or die $!;
my $tree = <$f>;
die unless $tree =~ /^tree (\w+)$/;
if (blob_in_tree($1)) {
print "$commit\n";
}
while (1) {
my $parent = <$f>;
last unless $parent =~ /^parent (\w+)$/;
push @commits, $1;
}
close($f);
}
if (!@ARGV) {
print STDERR "Usage: git-find-blob blob [head ...]\n";
exit 1;
}
$blob = $ARGV[0];
if (@ARGV > 1) {
foreach (@ARGV) {
handle_commit($_);
}
} else {
handle_commit("HEAD");
}
while (@commits) {
handle_commit(pop @commits);
}
今晚回家后,我会将这个放在github上。
更新:看起来已经有人做过了。那个使用了相同的一般思路,但细节不同,实现方法更加简短。我不知道哪个更快,但性能可能不是一个问题!
更新2:值得一提的是,我的实现方式在效率上比另外一个实现方式(第一个更新中链接的)要快数个数量级,尤其是对于大型存储库。那个git ls-tree -r
真的很耗时间。
更新3:我应该注意到,我上面的性能评论适用于我在第一个更新中链接的实现方式。 Aristotle的实现方式和我的表现相当。对于那些好奇的人,评论中有更多细节。
git rev-parse $commit^{}
。 - jthill虽然原问题并没有要求,但我认为检查暂存区是否引用了一个blob也很有用。我修改了原始的bash脚本以进行此操作,并在我的仓库中找到了引用损坏的blob的位置:
#!/bin/sh
obj_name="$1"
shift
git ls-files --stage \
| if grep -q "$obj_name"; then
echo Found in staging area. Run git ls-files --stage to see.
fi
git log "$@" --pretty=format:'%T %h %s' \
| while read tree commit subject ; do
if git ls-tree -r $tree | grep -q "$obj_name" ; then
echo $commit "$subject"
fi
done
所以...我需要找到一个大小为8GB,有超过108,000个修订版本的repo中所有超过给定限制的文件。我结合了Aristotle的perl脚本和我自己编写的ruby脚本来完成这个完整的解决方案。
首先,执行 git gc
- 这样可以确保所有对象都在packfiles中 - 我们不扫描不在pack文件中的对象。
接下来运行此脚本以查找所有大于CUTOFF_SIZE字节的blob。将输出捕获到一个名为“large-blobs.log”的文件中。
#!/usr/bin/env ruby
require 'log4r'
# The output of git verify-pack -v is:
# SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
#
#
GIT_PACKS_RELATIVE_PATH=File.join('.git', 'objects', 'pack', '*.pack')
# 10MB cutoff
CUTOFF_SIZE=1024*1024*10
#CUTOFF_SIZE=1024
begin
include Log4r
log = Logger.new 'git-find-large-objects'
log.level = INFO
log.outputters = Outputter.stdout
git_dir = %x[ git rev-parse --show-toplevel ].chomp
if git_dir.empty?
log.fatal "ERROR: must be run in a git repository"
exit 1
end
log.debug "Git Dir: '#{git_dir}'"
pack_files = Dir[File.join(git_dir, GIT_PACKS_RELATIVE_PATH)]
log.debug "Git Packs: #{pack_files.to_s}"
# For details on this IO, see https://dev59.com/SXM_5IYBdhLWcg3w9oMA
#
# Short version is, git verify-pack flushes buffers only on line endings, so
# this works, if it didn't, then we could get partial lines and be sad.
types = {
:blob => 1,
:tree => 1,
:commit => 1,
}
total_count = 0
counted_objects = 0
large_objects = []
IO.popen("git verify-pack -v -- #{pack_files.join(" ")}") do |pipe|
pipe.each do |line|
# The output of git verify-pack -v is:
# SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
data = line.chomp.split(' ')
# types are blob, tree, or commit
# we ignore other lines by looking for that
next unless types[data[1].to_sym] == 1
log.info "INPUT_THREAD: Processing object #{data[0]} type #{data[1]} size #{data[2]}"
hash = {
:sha1 => data[0],
:type => data[1],
:size => data[2].to_i,
}
total_count += hash[:size]
counted_objects += 1
if hash[:size] > CUTOFF_SIZE
large_objects.push hash
end
end
end
log.info "Input complete"
log.info "Counted #{counted_objects} totalling #{total_count} bytes."
log.info "Sorting"
large_objects.sort! { |a,b| b[:size] <=> a[:size] }
log.info "Sorting complete"
large_objects.each do |obj|
log.info "#{obj[:sha1]} #{obj[:type]} #{obj[:size]}"
end
exit 0
end
接下来,编辑该文件以删除您不需要的任何blob和顶部的INPUT_THREAD位。一旦您只有要查找的SHA1行,请像这样运行以下脚本:
cat edited-large-files.log | cut -d' ' -f4 | xargs git-find-blob | tee large-file-paths.log
git-find-blob
脚本位于以下位置。
#!/usr/bin/perl
# taken from: https://dev59.com/JXVC5IYBdhLWcg3wpi98
# and modified by Carl Myers <cmyers@cmyers.org> to scan multiple blobs at once
# Also, modified to keep the discovered filenames
# vi: ft=perl
use 5.008;
use strict;
use Memoize;
use Data::Dumper;
my $BLOBS = {};
MAIN: {
memoize 'check_tree';
die "usage: git-find-blob <blob1> <blob2> ... -- [<git-log arguments ...>]\n"
if not @ARGV;
while ( @ARGV && $ARGV[0] ne '--' ) {
my $arg = $ARGV[0];
#print "Processing argument $arg\n";
open my $rev_parse, '-|', git => 'rev-parse' => '--verify', $arg or die "Couldn't open pipe to git-rev-parse: $!\n";
my $obj_name = <$rev_parse>;
close $rev_parse or die "Couldn't expand passed blob.\n";
chomp $obj_name;
#$obj_name eq $ARGV[0] or print "($ARGV[0] expands to $obj_name)\n";
print "($arg expands to $obj_name)\n";
$BLOBS->{$obj_name} = $arg;
shift @ARGV;
}
shift @ARGV; # drop the -- if present
#print "BLOBS: " . Dumper($BLOBS) . "\n";
foreach my $blob ( keys %{$BLOBS} ) {
#print "Printing results for blob $blob:\n";
open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s'
or die "Couldn't open pipe to git-log: $!\n";
while ( <$log> ) {
chomp;
my ( $tree, $commit, $subject ) = split " ", $_, 3;
#print "Checking tree $tree\n";
my $results = check_tree( $tree );
#print "RESULTS: " . Dumper($results);
if (%{$results}) {
print "$commit $subject\n";
foreach my $blob ( keys %{$results} ) {
print "\t" . (join ", ", @{$results->{$blob}}) . "\n";
}
}
}
}
}
sub check_tree {
my ( $tree ) = @_;
#print "Calculating hits for tree $tree\n";
my @subtree;
# results = { BLOB => [ FILENAME1 ] }
my $results = {};
{
open my $ls_tree, '-|', git => 'ls-tree' => $tree
or die "Couldn't open pipe to git-ls-tree: $!\n";
# example git ls-tree output:
# 100644 blob 15d408e386400ee58e8695417fbe0f858f3ed424 filaname.txt
while ( <$ls_tree> ) {
/\A[0-7]{6} (\S+) (\S+)\s+(.*)/
or die "unexpected git-ls-tree output";
#print "Scanning line '$_' tree $2 file $3\n";
foreach my $blob ( keys %{$BLOBS} ) {
if ( $2 eq $blob ) {
print "Found $blob in $tree:$3\n";
push @{$results->{$blob}}, $3;
}
}
push @subtree, [$2, $3] if $1 eq 'tree';
}
}
foreach my $st ( @subtree ) {
# $st->[0] is tree, $st->[1] is dirname
my $st_result = check_tree( $st->[0] );
foreach my $blob ( keys %{$st_result} ) {
foreach my $filename ( @{$st_result->{$blob}} ) {
my $path = $st->[1] . '/' . $filename;
#print "Generating subdir path $path\n";
push @{$results->{$blob}}, $path;
}
}
}
#print "Returning results for tree $tree: " . Dumper($results) . "\n\n";
return $results;
}
<hash prefix> <oneline log message>
path/to/file.txt
path/to/file2.txt
...
<hash prefix2> <oneline log msg...>
等等,如果提交中包含一个大文件,它的树将被列出。 如果您使用grep
筛选以制表符开头的行,并uniq
这样做,您将获得可以过滤分支以删除所有路径的列表,或者您可以执行更复杂的操作。
让我再强调一下:这个过程在一个有108,000个提交的10GB存储库上成功运行。 当在大量的blob上运行时,它花费的时间比我预计的要长得多,超过了10个小时,我必须看看记忆位是否正常工作...
-- --all
。(在删除存储库历史记录中的大文件时,查找整个存储库的所有提交非常重要。) - peterflynn
git hash-object
或sha1(“blob”+ filesize + “\ 0”+ data)
返回的值,而不仅仅是blob内容的sha1sum。请注意不要改变原文的意思,修改后的翻译应该更加通俗易懂。 - Ivan Hamiltongit log --follow filepath
(如果需要加快Aristotle的解决方案,则可以使用此方法)。 - Zaz~/.bin
中,并将其命名为git-find-object
。然后您就可以使用git find-object
来调用它了。 - Zaz