我经常会遇到每行只有一个数字的文件,我最后会将其导入Excel中,以查看中位数、标准差等内容。
在Linux中是否有命令行实用程序可以做同样的事情?我通常需要找到平均值、中位数、最小值、最大值和标准差。
在Linux中是否有命令行实用程序可以做同样的事情?我通常需要找到平均值、中位数、最小值、最大值和标准差。
data_hacks
是一款基于Python的命令行实用工具,用于进行基本统计分析。
该页面中的第一个示例可以得到所需结果:
$ cat /tmp/data | histogram.py
# NumSamples = 29; Max = 10.00; Min = 1.00
# Mean = 4.379310; Variance = 5.131986; SD = 2.265389
# each * represents a count of 1
1.0000 - 1.9000 [ 1]: *
1.9000 - 2.8000 [ 5]: *****
2.8000 - 3.7000 [ 8]: ********
3.7000 - 4.6000 [ 3]: ***
4.6000 - 5.5000 [ 4]: ****
5.5000 - 6.4000 [ 2]: **
6.4000 - 7.3000 [ 3]: ***
7.3000 - 8.2000 [ 1]: *
8.2000 - 9.1000 [ 1]: *
9.1000 - 10.0000 [ 1]: *
还有simple-r,它几乎可以做到R所能做的一切,但击键次数较少:
https://code.google.com/p/simple-r/
为计算基本的描述性统计量,需要输入以下其中之一:
r summary file.txt
r summary - < file.txt
cat file.txt | r summary -
对于平均值、中位数、最小值、最大值和标准偏差,代码如下:
seq 1 100 | r mean -
seq 1 100 | r median -
seq 1 100 | r min -
seq 1 100 | r max -
seq 1 100 | r sd -
没有比这更简单的R语言了!
使用xsv:
$ echo '3 1 4 1 5 9 2 6 5 3 5 9' |tr ' ' '\n' > numbers-one-per-line.csv
$ xsv stats -n < numbers-one-per-line.csv
field,type,sum,min,max,min_length,max_length,mean,stddev
0,Integer,53,1,9,1,1,4.416666666666667,2.5644470922381863
# mode/median/cardinality not shown by default since it requires storing full file in memory:
$ xsv stats -n --everything < numbers-one-per-line.csv | xsv table
field type sum min max min_length max_length mean stddev median mode cardinality
0 Integer 53 1 9 1 1 4.416666666666667 2.5644470922381863 4.5 5 7
brew
安装这个软件需要安装很多依赖项,对于这个功能来说有点“重量级”。 - Alex Moore-Niemi#!/usr/bin/perl
#
# stdev - figure N, min, max, median, mode, mean, & std deviation
#
# pull out all the real numbers in the input
# stream and run standard calculations on them.
# they may be intermixed with other test, need
# not be on the same or different lines, and
# can be in scientific notion (avagadro=6.02e23).
# they also admit a leading + or -.
#
# Tom Christiansen
# tchrist@perl.com
use strict;
use warnings;
use List::Util qw< min max >;
#
my $number_rx = qr{
# leading sign, positive or negative
(?: [+-] ? )
# mantissa
(?= [0123456789.] )
(?:
# "N" or "N." or "N.N"
(?:
(?: [0123456789] + )
(?:
(?: [.] )
(?: [0123456789] * )
) ?
|
# ".N", no leading digits
(?:
(?: [.] )
(?: [0123456789] + )
)
)
)
# abscissa
(?:
(?: [Ee] )
(?:
(?: [+-] ? )
(?: [0123456789] + )
)
|
)
}x;
my $n = 0;
my $sum = 0;
my @values = ();
my %seen = ();
while (<>) {
while (/($number_rx)/g) {
$n++;
my $num = 0 + $1; # 0+ is so numbers in alternate form count as same
$sum += $num;
push @values, $num;
$seen{$num}++;
}
}
die "no values" if $n == 0;
my $mean = $sum / $n;
my $sqsum = 0;
for (@values) {
$sqsum += ( $_ ** 2 );
}
$sqsum /= $n;
$sqsum -= ( $mean ** 2 );
my $stdev = sqrt($sqsum);
my $max_seen_count = max values %seen;
my @modes = grep { $seen{$_} == $max_seen_count } keys %seen;
my $mode = @modes == 1
? $modes[0]
: "(" . join(", ", @modes) . ")";
$mode .= ' @ ' . $max_seen_count;
my $median;
my $mid = int @values/2;
if (@values % 2) {
$median = $values[ $mid ];
} else {
$median = ($values[$mid-1] + $values[$mid])/2;
}
my $min = min @values;
my $max = max @values;
printf "n is %d, min is %g, max is %d\n", $n, $min, $max;
printf "mode is %s, median is %g, mean is %g, stdev is %g\n",
$mode, $median, $mean, $stdev;
$ ls -lR | scut -f=4 | stats
Sum 3.10271e+07
Number 452
Mean 68643.9
Median 4469.5
Mode 4096
NModes 6
Min 2
Max 1.01171e+07
Range 1.01171e+07
Variance 3.03828e+11
Std_Dev 551206
SEM 25926.6
95% Conf 17827.9 to 119460
(for a normal distribution - see skew)
Skew 15.4631
(skew = 0 for a symmetric dist)
Std_Skew 134.212
Kurtosis 258.477
(K=3 for a normal dist)
另一个工具:tsv-summarize,来自eBay的tsv实用工具。支持最小值、最大值、平均值、中位数和标准差等统计信息。适用于大数据集。示例:
$ seq 10 | tsv-summarize --min 1 --max 1 --median 1 --stdev 1
1 10 5.5 3.0276503541
data.txt
:1
2
3
4
5
6
7
8
9
10
#!/usr/bin/env Rscript
# Build a numeric vector.
x <- as.numeric(readLines("stdin"))
# Custom basic statistics.
basic_stats <- data.frame(
N = length(x), min = min(x), mean = mean(x), median = median(x), stddev = sd(x),
percentile_95 = quantile(x, c(.95)), percentile_99 = quantile(x, c(.99)),
max = max(x))
# Print output.
print(round(basic_stats, 3), row.names = FALSE, right = FALSE)
basic-stats < data.txt
命令,将以下内容打印到标准输出(stdout): N min mean median stddev percentile_95 percentile_99 max
10 1 5.5 5.5 3.028 9.55 9.91 10
# Print output. Tabular formatting is done by the `column` command.
temp_file <- tempfile("basic_stats_", fileext = ".csv")
write.csv(round(basic_stats, 3), file = temp_file, row.names = FALSE, quote = FALSE)
system(paste("column -s, -t", temp_file))
. <- file.remove(temp_file)
N min mean median stddev percentile_95 percentile_99 max
10 1 5.5 5.5 3.028 9.55 9.91 10
seq 10 | gnuplot -e "stats '-' u 1"
* FILE:
Records: 10
Out of range: 0
Invalid: 0
Header records: 0
Blank: 0
Data Blocks: 1
* COLUMN:
Mean: 5.5000
Std Dev: 2.8723
Sample StdDev: 3.0277
Skewness: 0.0000
Kurtosis: 1.7758
Avg Dev: 2.5000
Sum: 55.0000
Sum Sq.: 385.0000
Mean Err.: 0.9083
Std Dev Err.: 0.6423
Skewness Err.: 0.7746
Kurtosis Err.: 1.5492
Minimum: 1.0000 [ 0]
Maximum: 10.0000 [ 9]
Quartile: 3.0000
Median: 5.5000
Quartile: 8.0000
jp
,一个用于绘制图表的 CLI 工具 感兴趣。 - Matt Parker