我已经尝试在这里提供一些方法的比较。
首先,我创建了一个Perl脚本来生成输入文件
file1.txt
和
file2.txt
。为了比较一些解决方案,我确保
file1.txt
中的单词只能出现在
file2.txt
的第二个字段中。另外,为了能够使用由 @GeorgeVasiliou 提供的
join
解决方案,我对
file1.txt
和
file2.txt
进行了排序。目前,我基于仅有75个随机单词(从
https://www.randomlists.com/random-words 获取)生成了输入文件。这75个单词中只有5个用于
file1.txt
,其余70个用于填充
file2.txt
中的字段。可能需要大幅增加单词数以获得真实结果(根据 OP 的说法,原始的
file1.txt
包含 14000 个单词)。在下面的测试中,我使用了一个包含1000000(1百万)行的
file2.txt
。该脚本还生成了grep解决方案中 @BOC 所需的文件
regexp1.txt
。
gen_input_files.pl:
use feature qw(say);
use strict;
use warnings;
use Data::Printer;
use Getopt::Long;
GetOptions ("num_lines=i" => \my $nlines )
or die("Error in command line arguments\n");
my $word_filename = 'words.txt';
my $num_match_words = 5;
my $num_file2_lines = $nlines || 1_000_000;
my $file2_words_per_line = 3;
my $file2_match_field_no = 2;
my $file1_filename = 'file1.txt';
my $file2_filename = 'file2.txt';
my $file1_regex_fn = 'regexp1.txt';
say "generating $num_file2_lines lines..";
my ( $words1, $words2 ) = get_words( $word_filename, $num_match_words );
write_file1( $file1_filename, $words2 );
write_file2(
$file2_filename, $words1, $words2, $num_file2_lines,
$file2_words_per_line, $file2_match_field_no
);
write_BOC_regexp_file( $file1_regex_fn, $words2 );
sub write_BOC_regexp_file {
my ( $fn, $words ) = @_;
open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";
print $fh '\\|' . (join "|", @$words) . '\\|';
close $fh;
}
sub write_file2 {
my ( $fn, $words1, $words2, $nlines, $words_per_line, $field_no ) = @_;
my $nwords1 = scalar @$words1;
my $nwords2 = scalar @$words2;
my @lines;
for (1..$nlines) {
my @words_line;
my $key;
for (1..$words_per_line) {
my $word;
if ( $_ != $field_no ) {
my $index = int (rand $nwords1);
$word = @{ $words1 }[$index];
}
else {
my $index = int (rand($nwords1 + $nwords2) );
if ( $index < $nwords2 ) {
$word = @{ $words2 }[$index];
}
else {
$word = @{ $words1 }[$index - $nwords2];
}
$key = $word;
}
push @words_line, $word;
}
push @lines, [$key, (join "|", @words_line)];
}
@lines = map { $_->[1] } sort { $a->[0] cmp $b->[0] } @lines;
open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";
print $fh (join "\n", @lines);
close $fh;
}
sub write_file1 {
my ( $fn, $words ) = @_;
open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";
print $fh (join "\n", sort @$words);
close $fh;
}
sub get_words {
my ( $fn, $N ) = @_;
open( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my @words = map {chomp $_; $_} <$fh>;
close $fh;
my @words1 = @words[$N..$#words];
my @words2 = @words[0..($N - 1)];
return ( \@words1, \@words2 );
}
接下来,我创建了一个子文件夹
solutions
并包含所有的测试用例:
$ tree solutions/
solutions/
├── BOC1
│ ├── out.txt
│ └── run.sh
├── BOC2
│ ├── out.txt
│ └── run.sh
├── codeforester
│ ├── out.txt
│ ├── run.pl
│ └── run.sh
[...]
这里的文件out.txt
是每个解决方案grep的输出结果。脚本run.sh
运行给定测试用例的解决方案。
不同解决方案的注释
BOC1
: First solution presented by @BOC
grep -E -f regexp1.txt file2.txt
BOC2
: Second solution suggested by @BOC:
LC_ALL=C grep -E -f regexp1.txt file2.txt
codeforester
: Accepted Perl solution by @codeforester ( see source )
codeforester_orig
: Original solution presented by @codeforested:
fgrep -f file1.txt file2.txt
dawg
: Python solution using dictionary and split line proposed by @dawg ( see source )
gregory1
: solution using Gnu Parallel suggested by @gregory
parallel -k --pipepart -a file2.txt --block "$block_size" fgrep -F -f file1.txt
See note below regarding how to choose $block_size
.
hakon1
: Perl solution provided by @HåkonHægland (see source). This solution requires compilation of the c-extension the first time the code is run. It does not require recompilation when file1.txt
or file2.txt
changes. Note: The time used to compile the c-extension at the initial run is not included in the run times presented below.
ikegami
: Solution using assembled regexp and using grep -P
as given by @ikegami. Note: The assembled regexp was written to a separate file regexp_ikegami.txt
, so the runtime of generating the regexp is not included in the comparison below. This is the code used:
regexp=$(< "regexp_ikegami.txt")
grep -P "$regexp" file2.txt
inian1
: First solution by @Inian using match()
awk 'FNR==NR{
hash[$1]; next
}
{
for (i in hash) if (match($0,i)) {print; break}
}' file1.txt FS='|' file2.txt
inian2
: Second solution by @Inian using index()
awk 'FNR==NR{
hash[$1]; next
}
{
for (i in hash) if (index($0,i)) {print; break}
}' file1.txt FS='|' file2.txt
inian3
: Third solution by @Inian checking only $2
field:
awk 'FNR==NR{
hash[$1]; next
}
$2 in hash' file1.txt FS='|' file2.txt
inian4
: 4th soultion by @Inian ( basically the same as codeforester_orig
with LC_ALL
) :
LC_ALL=C fgrep -f file1.txt file2.txt
inian5
: 5th solution by @Inian (same as inian1
but with LC_ALL
):
LC_ALL=C awk 'FNR==NR{
hash[$1]; next
}
{
for (i in hash) if (match($0,i)) {print; break}
}' file1.txt FS='|' file2.txt
inian6
: Same as inian3
but with LC_ALL=C
. Thanks to @GeorgeVasiliou for suggestion.
jjoao
: Compiled flex-generated C code as proposed by @JJoao (see source ). Note: Recompilation of the exectuable must be done each time file1.txt
changes. The time used to compile the executable is not included in the run times presented below.
oliv
: Python script provided by @oliv ( see source )
Vasiliou
: Using join
as suggested by @GeorgeVasiliou:
join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 file1.txt file2.txt
Vasiliou2
: Same as Vasiliou
but with LC_ALL=C
.
zdim
: Using Perl script provided by @zdim ( see source ). Note: This uses the regexp search version ( instead of split line solution ).
zdim2
: Same as zdim
except that it uses the split
function instead of regexp search for the field in file2.txt
.
注意事项
I experimented a little bit with Gnu parallel (see gregory1
solution above) to determine the optimal block size for my CPU. I have 4 cores, and and currently it seems that the optimal choice is to devide the file (file2.txt
) into 4 equal sized chunks, and run a single job on each of the 4 processors. More testing might be needed here. So for the first test case where file2.txt
is 20M, I set $block_size
to 5M ( see gregory1
solution above), whereas for the more realistic case presented below where file2.txt
is 268M, a $block_size
of 67M was used.
The solutions BOC1
, BOC2
, codeforester_orig
, inian1
, inian4
, inian5
, and gregory1
all used loose matching. Meaning that the words from file1.txt
did not have to match exactly in field #2 of file2.txt
. A match anywhere on the line was accepted. Since this behavior made it more difficult to compare them with the other methods, some modified methods were also introduced. The first two methods called BOC1B
and BOC2B
used a modified regexp1.txt
file. The lines in the original regexp1.txt
where on the form \|foo1|foo2|...|fooN\|
which would match the words at any field boundary. The modified file, regexp1b.txt
, anchored the match to field #2 exclusively using the form ^[^|]*\|foo1|foo2|...|fooN\|
instead.
Then the rest of the modified methods codeforester_origB
, inian1B
, inian4B
, inian5B
, and gregory1B
used a modified file1.txt
. Instead of a literal word per line, the modified file file1b.txt
used one regex per line on the form:
^[^|]*\|word1\|
^[^|]*\|word2\|
^[^|]*\|word3\|
[...]
and in addition, fgrep -f
was replaced by grep -E -f
for these methods.
运行测试
以下是用于运行所有测试的脚本。它使用Bash time
命令来记录每个脚本所花费的时间。请注意,time
命令返回三种不同的时间: real
, user
, 和 sys
。起初我使用了user
+ sys
,但在使用Gnu平行命令时发现这是不正确的,因此下面报告的时间现在是由time
返回的real
部分。有关time
返回的不同时间的更多信息,请参见此问题。
第一个测试使用包含5行的file1.txt
和包含1000000
行的file2.txt
。以下是run_all.pl
脚本的前52行,其余脚本可以在这里找到。
run_all.pl
use feature qw(say);
use strict;
use warnings;
use Cwd;
use Getopt::Long;
use Data::Printer;
use FGB::Common;
use List::Util qw(max shuffle);
use Number::Bytes::Human qw(format_bytes);
use Sys::Info;
GetOptions (
"verbose" => \my $verbose,
"check" => \my $check,
"single-case=s" => \my $case,
"expected=i" => \my $expected_no_lines,
) or die("Error in command line arguments\n");
my $test_dir = 'solutions';
my $output_file = 'out.txt';
my $wc_expected = $expected_no_lines;
my $tests = get_test_names( $test_dir, $case );
my $file2_size = get_file2_size();
my $num_cpus = Sys::Info->new()->device( CPU => () )->count;
chdir $test_dir;
my $cmd = 'run.sh';
my @times;
for my $case (@$tests) {
my $savedir = getcwd();
chdir $case;
say "Running '$case'..";
my $arg = get_cmd_args( $case, $file2_size, $num_cpus );
my $output = `bash -c "{ time -p $cmd $arg; } 2>&1"`;
my ($user, $sys, $real ) = get_run_times( $output );
print_timings( $user, $sys, $real ) if $verbose;
check_output_is_ok( $output_file, $wc_expected, $verbose, $check );
print "\n" if $verbose;
push @times, $real;
chdir $savedir;
}
say "Done.\n";
print_summary( $tests, \@times );
结果
这是运行测试的输出:
$ run_all.pl --verbose
Running 'inian3'..
..finished in 0.45 seconds ( user: 0.44, sys: 0.00 )
..no of output lines: 66711
Running 'inian2'..
..finished in 0.73 seconds ( user: 0.73, sys: 0.00 )
..no of output lines: 66711
Running 'Vasiliou'..
..finished in 0.09 seconds ( user: 0.08, sys: 0.00 )
..no of output lines: 66711
Running 'codeforester_orig'..
..finished in 0.05 seconds ( user: 0.05, sys: 0.00 )
..no of output lines: 66711
Running 'codeforester'..
..finished in 0.45 seconds ( user: 0.44, sys: 0.01 )
..no of output lines: 66711
[...]
概要
[@Vasiliou获得的结果显示在中间列。]
|Vasiliou
My Benchmark |Results | Details
-------------------------------|---------|----------------------
inian4 : 0.04s |0.22s | LC_ALL fgrep -f [loose]
codeforester_orig : 0.05s | | fgrep -f [loose]
Vasiliou2 : 0.06s |0.16s | [LC_ALL join [requires sorted files]]
BOC1 : 0.06s | | grep -E [loose]
BOC2 : 0.07s |15s | LC_ALL grep -E [loose]
BOC2B : 0.07s | | LC_ALL grep -E [strict]
inian4B : 0.08s | | LC_ALL grep -E -f [strict]
Vasiliou : 0.08s |0.23s | [join [requires sorted files]]
gregory1B : 0.08s | | [parallel + grep -E -f [strict]]
ikegami : 0.1s | | grep -P
gregory1 : 0.11s |0.5s | [parallel + fgrep -f [loose]]
hakon1 : 0.14s | | [perl + c]
BOC1B : 0.14s | | grep -E [strict]
jjoao : 0.21s | | [compiled flex generated c code]
inian6 : 0.26s |0.7s | [LC_ALL awk + split+dict]
codeforester_origB : 0.28s | | grep -E -f [strict]
dawg : 0.35s | | [python + split+dict]
inian3 : 0.44s |1.1s | [awk + split+dict]
zdim2 : 0.4s | | [perl + split+dict]
codeforester : 0.45s | | [perl + split+dict]
oliv : 0.5s | | [python + compiled regex + re.search()]
zdim : 0.61s | | [perl + regexp+dict]
inian2 : 0.73s |1.7s | [awk + index($0,i)]
inian5 : 18.12s | | [LC_ALL awk + match($0,i) [loose]]
inian1 : 19.46s | | [awk + match($0,i) [loose]]
inian5B : 42.27s | | [LC_ALL awk + match($0,i) [strict]]
inian1B : 85.67s | | [awk + match($0,i) [strict]]
Vasiliou Results : 2 X CPU Intel 2 Duo T6570 @ 2.10GHz - 2Gb RAM-Debian Testing 64bit- kernel 4.9.0.1 - no cpu freq scaling.
更现实的测试案例
接着我创建了一个更现实的案例,其中file1.txt
有100个单词,而file2.txt
则有1000万行(文件大小为268Mb)。我使用shuf -n1000 /usr/share/dict/american-english > words.txt
从/usr/share/dict/american-english
的字典中提取了1000个随机单词,然后从中选出100个单词放入file1.txt
中,并按照上述第一个测试案例所描述的方式构建了file2.txt
。请注意,字典文件采用UTF-8编码,我从words.txt
中删除了所有非ASCII字符。
然后我运行了测试,但排除了前一个案例中最慢的三种方法:inian1
、inian2
和inian5
。以下是新的结果:
gregory1 : 0.86s | [parallel + fgrep -f [loose]]
Vasiliou2 : 0.94s | [LC_ALL join [requires sorted files]]
inian4B : 1.12s | LC_ALL grep -E -f [strict]
BOC2B : 1.13s | LC_ALL grep -E [strict]
BOC2 : 1.15s | LC_ALL grep -E [loose]
BOC1 : 1.18s | grep -E [loose]
ikegami : 1.33s | grep -P
Vasiliou : 1.37s | [join [requires sorted files]]
hakon1 : 1.44s | [perl + c]
inian4 : 2.18s | LC_ALL fgrep -f [loose]
codeforester_orig : 2.2s | fgrep -f [loose]
inian6 : 2.82s | [LC_ALL awk + split+dict]
jjoao : 3.09s | [compiled flex generated c code]
dawg : 3.54s | [python + split+dict]
zdim2 : 4.21s | [perl + split+dict]
codeforester : 4.67s | [perl + split+dict]
inian3 : 5.52s | [awk + split+dict]
zdim : 6.55s | [perl + regexp+dict]
gregory1B : 45.36s | [parallel + grep -E -f [strict]]
oliv : 60.35s | [python + compiled regex + re.search()]
BOC1B : 74.71s | grep -E [strict]
codeforester_origB : 75.52s | grep -E -f [strict]
注意
基于 grep
的解决方案在查找整行匹配时可能会产生一些错误匹配:方法 codeforester_orig
、BOC1
、BOC2
、gregory1
、inian4
和 oliv
从 file2.txt
中提取了 1,087,609 行,而其他方法从中正确提取了 997,993 行。
注意事项
file1.txt
到例如pastebin.com并提供链接,则会很好。这样我们就可以在至少一些真实数据上尝试测试代码 :) - Håkon Hægland