使用PERL计算非可打印字符数

Question

使用PERL计算非可打印字符数

perlasciinon-ascii-charactersnon-printable

4

我有数十万个文件需要分析，我想计算这些文件中任意大小样本的可打印字符的百分比。这些文件来自于大型机、Windows、Unix等多种平台，因此很可能包含二进制和控制字符。

我最初使用Linux的“file”命令进行分析，但它提供的细节不够满足我的需求。下面的代码可以实现我的目标，但并不总是有效。

    #!/usr/bin/perl -n

    use strict;
    use warnings;

    my $cnt_n_print = 0;
    my $cnt_print = 0;
    my $cnt_total = 0;
    my $prc_print = 0;

    #Count the number of non-printable characters
    while ($_ =~ m/[^[:print:]]/g) {$cnt_n_print++};

    #Count the number of printable characters
    while ($_ =~ m/[[:print:]]/g) {$cnt_print++};

    $cnt_total = $cnt_n_print + $cnt_print;
    $prc_print = $cnt_print/$cnt_total;

    #Print the # total number of bytes read followed by the % printable
    print "$cnt_total|$prc_print\n"

这是一个有效的测试调用：

    echo "test_string of characters" | /home/user/scripts/prl/s16_count_chars.pl

这是我打算称呼它的方式，对于一个文件起作用：

    find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

这个没有正常工作：

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

这也不行：

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | perl -0 /home/user/scripts/prl/s16_count_chars.pl

与其对于find返回的每一行执行脚本，它仅仅执行一次以处理所有结果。

谢谢。

到目前为止的研究：

管道、XARGS和分隔符

http://help.lockergnome.com/linux/help-understand-pipe-xargs--ftopict549399.html

http://en.wikipedia.org/wiki/Xargs#The_separator_problem

澄清：
1.) 期望的输出：如果一个目录中有932个文件，则输出将是932个文件名、读取的总字节数以及可打印字符的%的列表（共932行）。
2.) 许多文件都是二进制的。脚本需要处理嵌入的二进制 eol 或 eof 序列。
3.) 许多文件很大，因此我只想读取前/后xx字节。我一直在尝试使用 head -c 256 或 tail -c 128 分别读取前256字节或最后128字节。解决方案可以在管道线中工作，也可以在perl脚本中限制字节数。

- Stan

while ($_ =~ m/[^[:print:]]/g) {$cnt_n_print++}; 更好的写法是 $cnt_n_print += ( () = m/[^[:print:]]/g ); （或者更好的方法是使用 tr///，但是它不支持 POSIX 类）。 - ysth

“更好”=更快，更简洁，但使用的内存更多。实际上可能会多得多。（每个匹配字符都需要一个完整的字符串标量！） - ikegami

不要啊！在 shebang 行上加 -n ！ - Borodin

3个回答

1

你可以让find每次传递一个参数。

find /fct/inbound/trans/ -type f -exec perl script.pl {} \;

但我会继续一次传递多个文件，可以通过 xargs 或使用 GNU find 的 -exec +。

find /fct/inbound/trans/ -type f -exec perl script.pl {} +

以下代码片段支持两种方式。

您可以逐行继续阅读：

#!/usr/bin/perl

use strict;
use warnings;

my $cnt_total   = 0;
my $cnt_n_print = 0;

while (<>) {
    $cnt_total += length;
    ++$cnt_n_print while /[^[:print:]]/g;
} continue {
    if (eof) {
        my $cnt_print = $cnt_total - $cnt_n_print;
        my $prc_print = $cnt_print/$cnt_total;

        print "$ARGV: $cnt_total|$prc_print\n";

        $cnt_total   = 0;
        $cnt_n_print = 0;
    }
}

或者您可以一次性读取整个文件：

#!/usr/bin/perl

use strict;
use warnings;

local $/;
while (<>) {
    my $cnt_n_print = 0;
    ++$cnt_n_print while /[^[:print:]]/g;

    my $cnt_total = length;
    my $cnt_print = $cnt_total - $cnt_n_print;
    my $prc_print = $cnt_print/$cnt_total;

    print "$ARGV: $cnt_total|$prc_print\n";
}

- ikegami

非常接近，但我认为它在二进制文件上会出现问题，并且我需要仅读取前X个字节（请参见上面的澄清）。而且我只能让GNU -exec工作。您能帮忙更新脚本，使其可以像Linux管道一样使用head/tail命令，例如： a）find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl或者像这样： b）find /path/to/analyze/ -type f -exec perl script.pl {} first 264 + c）find /path/to/analyze/ -type f -exec perl script.pl {} last 128 + - Stan

1

使用readline(<>)不太适合二进制文件，最好使用read。您可以使用for (@ARGV)循环遍历文件并自行打开。 - ikegami

你能否提供一个与find命令输出兼容的示例？我找到了这个参考资料，但仍然遇到问题：Perl文件处理：打开、读取、写入和关闭文件。 - Stan

抱歉，我无法帮助你解决你没有提及的问题。 - ikegami

谢谢你的帮助。我根据你的建议发布了我的工作解决方案。 - Stan

0

这是我基于反馈提供的可行解决方案。

我会感激任何关于表单或更高效方法的进一步反馈：

    #!/usr/bin/perl

    use strict;
    use warnings;

    # This program receives a file path and name.
    # The program attempts to read the first 2000 bytes.
    # The output is a list of files, the number of bytes
    # actually read and the percent of tbe bytes that are
    # ASCII "printable" aka [\x20-\x7E].

    my ($data, $n_bytes, $file_name, $cnt_n_print, $cnt_print, $prc_print);

    # loop through each file
    foreach(@ARGV) {
       $file_name = shift or die "Pass the file name on the command line.\n";

       # open the file read only with "<" in "<$file_name"
       open(FILE, "<$file_name") or die "Can't open $file_name: $!";

       # open each file in binary mode to handle non-printable characters
       binmode FILE;

       # try to read 2000 bytes from FILE, save the results in $data and the
       # actual number of bytes read in $n_bytes
       $n_bytes = read FILE, $data, 2000;

       $cnt_n_print = 0;
       $cnt_print = 0;

       # count the number of non-printable characters
       ++$cnt_n_print while ($data =~ m/[^[:print:]]/g);

       $cnt_print = $n_bytes - $cnt_n_print;
       $prc_print = $cnt_print/$n_bytes;

       print "$file_name|$n_bytes|$prc_print\n";
       close(FILE);
    }

以下是如何调用上述脚本的示例：

    find /some/path/to/files/ -type f -exec perl this_script.pl {} +

这是我发现有用的参考资料列表：

POSIX括号表达式
 以二进制模式打开文件
 读取函数
 只读方式打开文件

- Stan

在进一步测试中，我发现如果使用上述列出的“find”命令调用此解决方案，则会跳过目录中的某些文件。例如，在一个目录中有39个文件，但脚本只输出20个文件的信息。如果我逐个运行每个文件的脚本，那么对于使用“find”跳过的19个文件，它也可以正常工作。您有任何想法如何使脚本运行目录中的所有文件吗？ - Stan

如果我从批处理脚本中调用Perl脚本，它将运行所有文件：find /some/path/to/files/ -type f -print|while read filename do perl /path/to/this_script.pl $filename done。那么正确的做法是什么？ @mob @ikegami @ysth - Stan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mob · Accepted Answer

-n选项会将您的整个代码包装在一个while(defined($_=<ARGV>) { ... }块中。这意味着每行输入都会重复使用my $cnt_print和其他变量声明，从本质上重新设置所有变量值。

解决方法是使用全局变量（如果想继续使用use strict，请用our声明它们），并且不要将它们初始化为0，因为它们会在每行输入时重新初始化。您可以这样说：

our $cnt_print //= 0;

如果您不想在第一行输入时出现$cnt_print及其相关内容未定义的情况，请参考此近期提出的类似问题。