提高Perl代码的性能

Question

提高Perl代码的性能

3

以下程序可以正常运行，但是大量数据需要无限的时间。
INPUT.txt . 实际上，我有多达1000行，每行有1到100个元素。

10  
6  
9  
7  
9 11  
3 4  
1 9  
5 12  
1 11  
5 11  
9 12  
10 5 8  
7 4 1
and so on...  
last: 1 2 3 4 5 6 7 . . .any number of elements (100 in my case).

matrix.txt (TAB分隔)

1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   1   1   1   
1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   
1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   0   1   1   1   1   0   0   1   1   1   1   1   1   
1   1   1   1   1   1   0   1   1   1   1   1   1   1   0   1   1   0   1   1   1   1   0   1   0   0   1   1   
1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   1   0   
1   0   1   1   1   1   0   1   1   1   1   0   1   1   0   1   1   0   1   1   1   1   0   1   0   1   1   1   
1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   1   1   1   
1   0   1   1   1   1   0   1   1   1   1   0   1   1   0   0   1   0   1   1   1   1   1   1   0   0   1   1   
1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   0   1   0   1   0   1   1   1   1   1   1   1   0   
and so on....upto 25000 lines

输出.txt
这些是在每个输入行中从matrix.txt取出的索引位置处元素的总和。
实际总和可能与此假设的示例输出不同。

1   1   1   1   1   0   1   1   1   2   2   2   2   2 . . .columns upto number of lines in input.txt
1   1   1   1   1   1   1   1   1   2   2   2   2   2
1   0   0   1   1   1   1   1   1   2   2   2   2   2
1   1   1   0   1   0   0   1   1   2   2   2   2   2
1   1   1   1   1   1   1   1   0   2   2   2   2   2
1   1   1   0   1   0   1   1   1   1   2   2   2   2
1   1   1   1   1   0   1   1   1   2   2   2   2   2
1   1   1   1   1   0   0   1   1   1   2   2   2   2
0   1   1   1   1   1   1   1   0   2   2   2   2   2

代码: 看看代码，它会帮助你理解正在发生的事情。

use List::Util 'sum';
my @indexes = do {
    open my $fh, '<', "INPUT.txt";
    map { [map {$_ - 1} split ' '] } <$fh>
};
open my $infh, '<', "matrix.txt";
open OUT, '>', "output.txt";
while (<$infh>) {
    my @vals = split ' ';
    print OUT join('    ', map {sum(@vals[@$_])} @indexes), "\n";
}
close OUT;

有没有其他方法可以在更短的时间内完成这项任务。

文件可用性：
输入文件：https://www.dropbox.com/s/48ikhnfs7gzk8vm/input.txt?dl=0
矩阵文件：https://www.dropbox.com/s/ebxi608eday9z1e/matrix.txt?dl=0

- BioDeveloper

你能否提供一下你的input.txt（和matrix.txt）的链接？ - osirisgothra

@osirisgothra 输入：https://www.dropbox.com/s/48ikhnfs7gzk8vm/input.txt?dl=0矩阵：https://www.dropbox.com/s/ebxi608eday9z1e/matrix.txt?dl=0 - BioDeveloper

你对程序进行了性能分析吗？如果没有，请尝试使用Devel::NYTProf来查找瓶颈所在。此外，你的问题似乎是PDL适合解决的任务。虽然它很难理解，但如果你预计会遇到类似的任务，那么学习它可能是值得的。它依赖于Module::Compile，而该模块的最新版本在许多系统上测试失败。尝试像这样安装：cpanm Module::Compile@0.30以获取在大多数系统上都可用的版本。 - Patrick J. S.

matrix.txt 中的行是否是固定大小的？ - Patrick J. S.

@PatrickJ.S. 是的，固定大小。 - BioDeveloper

3个回答

2

我制作了一个PDL版本，利用的是您基本上使用选择向量执行矩阵乘法的事实。此版本假定矩阵始终包含100个元素。如果不是这样，您必须相应地更改零调用。

对于大小为（1 000 x 100）（25 000 x 100）的输入，它运行速度大约快两倍。将整个矩阵读入内存，然后处理结果会导致相同的运行时间，尽管如果启用并行处理，可能会更快。如果您想知道优化后的C版本的近似运行时间下限，它比该版本快4倍（原始版本的8倍）。当然，所有时间都与我的计算机相关，但我希望在大多数计算机上看到类似的比率。我也不声称我的PDL是最佳的，因为我把它用作学习的借口。

use strict;
use warnings;

use PDL;

my $indexes = PDL::long(do {
    open(my $fh, '<', 'INPUT.txt') or die;
    # The first map is if you allow duplicates in the index list (i.e. 2 2 is a valid row)
    # map { my $p = zeroes(100); $p->slice($_)++ foreach (map {$_ - 1} split /\t/); $p } <$fh>
    map { zeroes(100)->dice([map {$_ - 1} split /\t/])++ } <$fh>
})->xchg(0, 1);

open(my $input, '<', 'matrix.txt') or die;
open(my $output, '>', 'output.txt') or die;

while(<$input>) {
    my $vals = PDL::long(split(/\t/));
    print $output join("\t", ($vals x $indexes)->list) . "\n";
}

- Tim Tom

一些内存（和可能的空间）变化：第一个PDL::long应该是indx，而对于第二个，你实际上可以使用short。 - Patrick J. S.

谢谢你的建议！虽然我在在线文档中没有看到indx，但现在我知道它是正确的类型。至于short，我不确定他的矩阵数据范围是什么（我想我应该基于缺乏知识使用Double，但由于示例中的一切都只是一个位，所以我默认为整数）。 - Tim Tom

实际上，两种数据类型应该匹配。在传递到dice的数组之后，实际使用的是一个计数器（我知道最多只有一个字节，但由于我将其乘以第二个数据集，因此应该匹配以减少类型转换）。 - Tim Tom

是的，我在假设矩阵的值只有1和0的情况下尝试使用PDL解决了这个问题，但并没有太大的加速成功（我猜是因为类型转换）。我找到的唯一其他方法就是可以将<$input>替换为一个read或将$/设置为整数引用，因为您知道确切的大小，可以节省一些搜索，并且可以一次读取更多行。 - Patrick J. S.

0

你是否有更好的想法，知道哪个“位”会影响性能？

我问这个问题的原因是，性能瓶颈有一种神圣三位一体：

CPU - 处理器上实际执行的操作
“活跃”内存（内存配置文件大小与可用RAM以及您正在重新排列的数量）。
IO - 传输数据到/从磁盘。

通常可以在其中一个方面进行权衡 - 通过创建查找表等方式获得CPU效率。

像map这样的操作是我开始仔细研究的操作 - 像map / sort / grep这样的操作非常强大，但可能使用不太优化的算法。

如果您的CPU受限，可以尝试使用多线程或分叉来增加CPU访问。乍一看，它看起来像您的“matrix.txt”处理没有依赖关系（例如，每行都是独立的），因此它可能是并行处理的一个很好的候选项。

我会考虑使用Parallel :: ForkManager来包装那个while循环。这样做的缺点是，您将具有输出的非确定性排序，需要解决。

因此，初学者的起点可能是：

use List::Util 'sum';
use Data::Dumper;
use Fcntl qw(:flock);

use Parallel::ForkManager;

my $mgr = Parallel::ForkManager->new(10);

my @indexes = do {
    open my $fh, '<', "INPUT.txt";
    map {
        [ map { $_ - 1 } split ' ' ]
    } <$fh>;
};
open my $infh,   '<', "matrix.txt";
open my $out_fh, '>', "output.txt";
while (<$infh>) {
    $mgr->start and next;
    my @vals = split ' ';
    my $output_line = join( '    ', map { sum( @vals[@$_] ) } @indexes ),
        "\n";
    {
        flock( $out_fh, LOCK_EX );
        print {$out_fh} $output_line;
    }
}
close $out_fh;

注意 - 这个方法可以工作，但是输出顺序是随机的，几乎肯定不是你想要的。但它会同时使用10个处理器来执行“join/map/sum”操作。

（当然，如果你受到IO限制，这并没有什么帮助）。

但对于同步IO，我发现线程是一个相当不错的选择：

 use warnings;
 use strict;

use List::Util 'sum';

use threads; 
use Thread::Queue;

my $line_q = Thread::Queue -> new(); 
my $output_q = Thread::Queue -> new(); 

my %line_output : shared; 

    my @indexes = do {
        open my $fh, '<', "INPUT.txt";
        map {
            [ map { $_ - 1 } split ' ' ]
        } <$fh>;
};


sub generate_output {
   while ( my $item = $line_q -> dequeue() ) {
   print "processing $item \n";
       my ( $line_num, @vals ) = split ( ' ', $item );           
       $output_q -> enqueue($line_num.":". join('    ', map {sum(@vals[@$_])} @indexes ). "\n");
   }
}

sub coalesce_output {
    open my $out_fh, '>', "output.txt";
    my $current_line = 0; 
    my %lines;
    while ( my $item = $output_q -> dequeue ) {
        my ( $line_num, $output_line ) = split ( ":", $item );
        if ( $line_num = $current_line ) { 
            print {$out_fh} $output_line;
            $current_line++; 
        }
        else {
           $lines{$line_num} = $output_line; 
        }
        while ( defined $lines{$current_line} ) {
            print {$out_fh} $lines{$current_line};
            delete $lines{$current_line};
            $current_line++;
        }
    }
}




open my $infh,   '<', "matrix.txt";

my @workers;
for ( 1..10 ) {
  push ( @workers, threads -> create ( \&generate_output ) ); 
}

threads -> create ( \&coalesce_output );

while (my $line = <$infh>) {
    $line_q -> enqueue ( "$.: $line" );
}

$line_q -> end();
foreach my $thr ( @workers ) {
  $thr -> join(); 
}

$output_q -> end();

例如，可以并行地启动10个“工作线程”来执行求和操作，并启动一个“输出线程”以正确的顺序写入数据。

因此，可以实现以下类似的功能：

use warnings;
use strict;

use List::Util 'sum';

use threads;
use Thread::Queue;

my $line_q   = Thread::Queue->new();
my $output_q = Thread::Queue->new();

my @indexes = do {
    open my $fh, '<', "INPUT.txt";
    map {
        [ map { $_ - 1 } split ' ' ]
    } <$fh>;
};


sub generate_output {
    while ( my $item = $line_q->dequeue() ) {

        #print "processing $item \n";
        my ( $line_num, @vals ) = split( ' ', $item );
        $output_q->enqueue( $line_num . ":"
                . join( '    ', map { sum( @vals[@$_] ) } @indexes )
                . "\n" );
    }
}

sub coalesce_output {
    open my $out_fh, '>', "output.txt";
    my $current_line = 1;
    my %lines;
    while ( my $item = $output_q->dequeue ) {

        my ( $line_num, $output_line ) = split( ":", $item );

        #     print "Got $line_num ($current_line) $item\n";
        if ( $line_num = $current_line ) {

            #   print "printing $current_line = $output_line\n";
            print {$out_fh} $output_line;
            $current_line++;
        }
        else {
            $lines{$line_num} = $output_line;
        }
        while ( defined $lines{$current_line} ) {

    #   print "printing  (while) $current_line = $lines{$current_line}\n";
            print {$out_fh} $lines{$current_line};
            delete $lines{$current_line};
            $current_line++;
        }
    }
}


open my $infh, '<', "matrix.txt";

my @workers;
for ( 1 .. 40 ) {
    push( @workers, threads->create( \&generate_output ) );
}

threads->create( \&coalesce_output );

while ( my $line = <$infh> ) {
    $line_q->enqueue("$. $line");
}

$line_q->end();
foreach my $thr (@workers) {
    $thr->join();
}

$output_q->end();
foreach my $thr ( threads -> list ) { $thr -> join(); }

生成（+更多）：

 1    1    1    1    1    1    1    1    1    1    1    1    2    2    2    2
 2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2
 2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2
 2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2
 2    2    2    2    2    2    2    2    2    2    2    2    2    2    3    3
 3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3
 3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3

最终，这取决于您的限制因素是什么。

进行快速而简单的测试结果如下：

Started at 1417007048,
finished at 1417007064
Took:16s

对比

Started at 1417007118
finished at 1417007161
Took:43s

（我还没有对两者的输出进行全面验证）

- Sobrique

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Doggerel · Accepted Answer

有一件事情可以尝试，就是使用CPAN上的数学和矩阵相关模块。其中一些模块使用本地代码（即Perl的C扩展），理论上应该更快。这里有一个（有点过时的）介绍-

http://www.perlmonks.org/?node_id=284324