在Linux中,有一个命令行实用程序可以打印数字统计信息。

84
我经常会遇到每行只有一个数字的文件,我最后会将其导入Excel中,以查看中位数、标准差等内容。

在Linux中是否有命令行实用程序可以做同样的事情?我通常需要找到平均值、中位数、最小值、最大值和标准差。


1
这可能是相关的:https://dev59.com/c3VC5IYBdhLWcg3wsTRi。 - Oliver Charlesworth
投票关闭,因为工具推荐。http://stats.stackexchange.com/questions/24934/command-line-tool-to-calculate-basic-statistics-for-stream-of-values || http://serverfault.com/questions/548322/tool-to-do-statistics-in-the-linux-command-line - Ciro Santilli OurBigBook.com
http://unix.stackexchange.com/questions/13731/is-there-a-way-to-get-the-min-max-median-and-average-of-a-list-of-numbers-in - Ciro Santilli OurBigBook.com
对这个问题感兴趣的人可能也会对 jp,一个用于绘制图表的 CLI 工具 感兴趣。 - Matt Parker
https://unix.stackexchange.com/a/202889/44236 - arun
18个回答

8

data_hacks是一款基于Python的命令行实用工具,用于进行基本统计分析。

该页面中的第一个示例可以得到所需结果:

$ cat /tmp/data | histogram.py
# NumSamples = 29; Max = 10.00; Min = 1.00
# Mean = 4.379310; Variance = 5.131986; SD = 2.265389
# each * represents a count of 1
    1.0000 -     1.9000 [     1]: *
    1.9000 -     2.8000 [     5]: *****
    2.8000 -     3.7000 [     8]: ********
    3.7000 -     4.6000 [     3]: ***
    4.6000 -     5.5000 [     4]: ****
    5.5000 -     6.4000 [     2]: **
    6.4000 -     7.3000 [     3]: ***
    7.3000 -     8.2000 [     1]: *
    8.2000 -     9.1000 [     1]: *
    9.1000 -    10.0000 [     1]: *

3

还有simple-r,它几乎可以做到R所能做的一切,但击键次数较少:

https://code.google.com/p/simple-r/

为计算基本的描述性统计量,需要输入以下其中之一:

r summary file.txt
r summary - < file.txt
cat file.txt | r summary -

对于平均值、中位数、最小值、最大值和标准偏差,代码如下:

seq 1 100 | r mean - 
seq 1 100 | r median -
seq 1 100 | r min -
seq 1 100 | r max -
seq 1 100 | r sd -

没有比这更简单的R语言了!


有趣的是,这是一个Perl封装器到R。R不是一种编程语言!>:-) - Ciro Santilli OurBigBook.com

3

使用xsv

$ echo '3 1 4 1 5 9 2 6 5 3 5 9' |tr ' ' '\n' > numbers-one-per-line.csv

$ xsv stats -n < numbers-one-per-line.csv 
field,type,sum,min,max,min_length,max_length,mean,stddev
0,Integer,53,1,9,1,1,4.416666666666667,2.5644470922381863

# mode/median/cardinality not shown by default since it requires storing full file in memory:
$ xsv stats -n --everything < numbers-one-per-line.csv | xsv table
field  type     sum  min  max  min_length  max_length  mean               stddev              median  mode  cardinality
0      Integer  53   1    9    1           1           4.416666666666667  2.5644470922381863  4.5     5     7

1
使用 brew 安装这个软件需要安装很多依赖项,对于这个功能来说有点“重量级”。 - Alex Moore-Niemi
那么不使用brew吗?https://github.com/BurntSushi/xsv/releases有macOS的预编译二进制文件,因此没有理由安装完整的Rust工具链或者brew所做的任何事情。 - unhammer

3
#!/usr/bin/perl
#
# stdev - figure N, min, max, median, mode, mean, & std deviation
#
# pull out all the real numbers in the input
# stream and run standard calculations on them.
# they may be intermixed with other test, need
# not be on the same or different lines, and 
# can be in scientific notion (avagadro=6.02e23).
# they also admit a leading + or -.
#
# Tom Christiansen
# tchrist@perl.com

use strict;
use warnings;

use List::Util qw< min max >;

#
my $number_rx = qr{

  # leading sign, positive or negative
    (?: [+-] ? )

  # mantissa
    (?= [0123456789.] )
    (?: 
        # "N" or "N." or "N.N"
        (?:
            (?: [0123456789] +     )
            (?:
                (?: [.] )
                (?: [0123456789] * )
            ) ?
      |
        # ".N", no leading digits
            (?:
                (?: [.] )
                (?: [0123456789] + )
            ) 
        )
    )

  # abscissa
    (?:
        (?: [Ee] )
        (?:
            (?: [+-] ? )
            (?: [0123456789] + )
        )
        |
    )
}x;

my $n = 0;
my $sum = 0;
my @values = ();

my %seen = ();

while (<>) {
    while (/($number_rx)/g) {
        $n++;
        my $num = 0 + $1;  # 0+ is so numbers in alternate form count as same
        $sum += $num;
        push @values, $num;
        $seen{$num}++;
    } 
} 

die "no values" if $n == 0;

my $mean = $sum / $n;

my $sqsum = 0;
for (@values) {
    $sqsum += ( $_ ** 2 );
} 
$sqsum /= $n;
$sqsum -= ( $mean ** 2 );
my $stdev = sqrt($sqsum);

my $max_seen_count = max values %seen;
my @modes = grep { $seen{$_} == $max_seen_count } keys %seen;

my $mode = @modes == 1 
            ? $modes[0] 
            : "(" . join(", ", @modes) . ")";
$mode .= ' @ ' . $max_seen_count;

my $median;
my $mid = int @values/2;
if (@values % 2) {
    $median = $values[ $mid ];
} else {
    $median = ($values[$mid-1] + $values[$mid])/2;
} 

my $min = min @values;
my $max = max @values;

printf "n is %d, min is %g, max is %d\n", $n, $min, $max;
printf "mode is %s, median is %g, mean is %g, stdev is %g\n", 
    $mode, $median, $mean, $stdev;

2
此外,自编写的 stats(与 'scut' 捆绑在一起)是一个 Perl 实用程序,可执行此操作。将数字流提供给 STDIN,它会尝试拒绝非数字并发出以下内容:
$ ls -lR | scut -f=4 | stats
Sum       3.10271e+07
Number    452
Mean      68643.9
Median    4469.5
Mode      4096
NModes    6
Min       2
Max       1.01171e+07
Range     1.01171e+07
Variance  3.03828e+11
Std_Dev   551206
SEM       25926.6
95% Conf  17827.9 to 119460
          (for a normal distribution - see skew)
Skew      15.4631
          (skew = 0 for a symmetric dist)
Std_Skew  134.212
Kurtosis  258.477
          (K=3 for a normal dist)

它还可以对输入流进行多种转换,并且如果您要求,它只会发出未装饰的值;即“stats --mean”将返回未标记的平均值作为浮点数。

2

另一个工具:tsv-summarize,来自eBay的tsv实用工具。支持最小值、最大值、平均值、中位数和标准差等统计信息。适用于大数据集。示例:

$ seq 10 | tsv-summarize --min 1 --max 1 --median 1 --stdev 1
1    10    5.5    3.0276503541

免责声明:我是作者。

0
所选答案使用了R。使用相同的工具,我发现一个脚本更容易使用(而不是一行代码),因为它可以更舒适地进行修改,以添加任何特定的统计数据或以不同的格式输出。
给定这个文件data.txt
1
2
3
4
5
6
7
8
9
10

在$PATH中有这个basic-stats脚本:
#!/usr/bin/env Rscript

# Build a numeric vector.
x <- as.numeric(readLines("stdin"))

# Custom basic statistics.
basic_stats <- data.frame(
    N = length(x), min = min(x), mean = mean(x), median = median(x), stddev = sd(x),
    percentile_95 = quantile(x, c(.95)), percentile_99 = quantile(x, c(.99)),
    max = max(x))

# Print output.
print(round(basic_stats, 3), row.names = FALSE, right = FALSE)

执行basic-stats < data.txt命令,将以下内容打印到标准输出(stdout):
 N  min mean median stddev percentile_95 percentile_99 max
 10 1   5.5  5.5    3.028  9.55          9.91          10 

将脚本的最后两行替换为以下内容,可以使格式看起来更好一些:
# Print output. Tabular formatting is done by the `column` command.
temp_file <- tempfile("basic_stats_", fileext = ".csv")
write.csv(round(basic_stats, 3), file = temp_file, row.names = FALSE, quote = FALSE)
system(paste("column -s, -t", temp_file))
. <- file.remove(temp_file)

这是当前的输出,列之间有两个空格(而不是一个空格):
N   min  mean  median  stddev  percentile_95  percentile_99  max
10  1    5.5   5.5     3.028   9.55           9.91           10

0
不够解决方案?-):我想加入gnuplot统计命令。 Gnuplot是一个非常快速的数据分析工具 - 绘图,回归...
seq 10 | gnuplot -e "stats '-' u 1"

* FILE: 
  Records:           10
  Out of range:       0
  Invalid:            0
  Header records:     0
  Blank:              0
  Data Blocks:        1

* COLUMN: 
  Mean:               5.5000
  Std Dev:            2.8723
  Sample StdDev:      3.0277
  Skewness:           0.0000
  Kurtosis:           1.7758
  Avg Dev:            2.5000
  Sum:               55.0000
  Sum Sq.:          385.0000

  Mean Err.:          0.9083
  Std Dev Err.:       0.6423
  Skewness Err.:      0.7746
  Kurtosis Err.:      1.5492

  Minimum:            1.0000 [ 0]
  Maximum:           10.0000 [ 9]
  Quartile:           3.0000 
  Median:             5.5000 
  Quartile:           8.0000 

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接