BEGIN {
split(w, weight)
total = 0
for (i in weight) {
weight[i] += total
total = weight[i]
}
}
FNR == 1 {
if (NR!=1) {
write_partitioned_files(weight,a)
split("",a,":")
}
name=FILENAME
}
{a[FNR]=$0}
END {
write_partitioned_files(weight,a)
}
function write_partitioned_files(weight, a) {
split("",threshold,":")
size = length(a)
for (i in weight){
threshold[length(threshold)] = int((size * weight[i] / total)+0.5)+1
}
l=1
part=0
for (i in threshold) {
close(out)
out = name ".part" ++part
for (;l<threshold[i];l++) {
print a[l] " > " out
}
}
}
调用方式:
awk -v w="60 20 20" -f above_script.awk file_to_split1 file_to_split2 ...
在脚本中用
>
替换
" > "
,以实际写入分区文件。
变量
w
期望使用空格分隔的数字。文件按比例进行分区。例如
"2 1 1 3"
将文件分成四个部分,每个部分的行数比例为2:1:1:3。任何总和为100的数字序列都可以用作百分比。
对于大文件,数组
a
可能会消耗太多内存。如果有问题,这是一个可替代的
awk
脚本:
BEGIN {
split(w, weight)
for (i in weight) {
total += weight[i]; weight[i] = total
}
}
FNR == 1 {
name = gensub("'", "'\"'\"'", "g", FILENAME)
"wc -l '" name "'" | getline size
split("", threshold, ":")
for (i in weight){
threshold[length(threshold)+1] = int((size * weight[i] / total)+0.5)+1
}
part=1; close(out); out = FILENAME ".part" part
}
{
if(FNR>=threshold[part]) {
close(out); out = FILENAME ".part" ++part
}
print $0 " > " out
}
这个方法会对每个文件进行两次操作。第一次是通过 wc -l
命令计算文件行数,第二次则是在写入分割后的文件时。调用和效果与第一种方法类似。