如何按行数百分比拆分文件？

Question

如何按行数百分比拆分文件？

11

如何按行数比例拆分文件？

假设我想将文件分成三个部分（60％/ 20％/ 20％），我可以手动操作，-_- ：

$ wc -l brown.txt 
57339 brown.txt

$ bc <<< "57339 / 10 * 6"
34398
$ bc <<< "57339 / 10 * 2"
11466
$ bc <<< "34398 + 11466"
45864
bc <<< "34398 + 11466 + 11475"
57339

$ head -n 34398 brown.txt > part1.txt
$ sed -n 34399,45864p brown.txt > part2.txt
$ sed -n 45865,57339p brown.txt > part3.txt
$ wc -l part*.txt
   34398 part1.txt
   11466 part2.txt
   11475 part3.txt
   57339 total

但我相信有更好的方法！

- alvas

2

请问您能详细说明一下“可靠和/或官方来源”的要求吗？为什么您已经收到的高质量答案不够呢？ - Dario

错误的赏金信息，应该是“寻求关注”。 - alvas

百分比必须非常精确吗？我猜想你有很多行数据，我的理解正确吗？ - Mark Setchell

@marksetchell，尽可能精确最好。但如果由于浮点数舍入而有1-2行掉落，也是可以接受的。是的，我的实际数据确实很大，以百万计算。 - alvas

只要它不需要编译并且可以在Unix shell上轻松运行，就应该没问题。 - alvas

显示剩余2条评论

6个回答

9

$ cat file
a
b
c
d
e

$ cat tst.awk
BEGIN {
    split(pcts,p)
    nrs[1]
    for (i=1; i in p; i++) {
        pct += p[i]
        nrs[int(size * pct / 100) + 1]
    }
}
NR in nrs{ close(out); out = "part" ++fileNr ".txt" }
{ print $0 " > " out }

$ awk -v size=$(wc -l < file) -v pcts="60 20 20" -f tst.awk file
a > part1.txt
b > part1.txt
c > part1.txt
d > part2.txt
e > part3.txt

将" > "改为>，以便实际写入输出文件。

- Ed Morton - SO stop bullying

pct 和 nrs 是什么意思？ - hek2mgl

pct 表示百分比。nrs 是行/记录号，是输出文件编号发生变化的 NRs 列表。 - Ed Morton

简短明了，但在小百分比/文件方面存在一些问题。考虑一个有2行和pct="10 90"的文件。脚本将把这两行都写入part1.txt。 - Socowi

3

使用方法

以下bash脚本允许您指定百分比，例如：

./split.sh brown.txt 60 20 20

你也可以使用占位符.，将百分比填满到100%。

./split.sh brown.txt 60 20 .

分割的文件已写入。

part1-brown.txt
part2-brown.txt
part3-brown.txt

这个脚本会根据指定的数字生成相应数量的 part 文件。如果百分比总和为100，cat part* 将总是生成原始文件（没有重复或缺失行）。

Bash 脚本：split.sh

#! /bin/bash

file="$1"
fileLength=$(wc -l < "$file")
shift

part=1
percentSum=0
currentLine=1
for percent in "$@"; do
        [ "$percent" == "." ] && ((percent = 100 - percentSum)) 
        ((percentSum += percent))
        if ((percent < 0 || percentSum > 100)); then
                echo "invalid percentage" 1>&2
                exit 1
        fi
        ((nextLine = fileLength * percentSum / 100))
        if ((nextLine < currentLine)); then
                printf "" # create empty file
        else
                sed -n "$currentLine,$nextLine"p "$file"
        fi > "part$part-$file"
        ((currentLine = nextLine + 1))
        ((part++))
done

- Socowi

1

我只是跟随您的步骤，将您手动操作的内容转换为脚本。虽然它可能不是最快或最好的，但如果您现在理解自己在做什么，并可以将其“脚本化”，那么如果需要维护它，您可能会更好。

#!/bin/bash

#  thisScript.sh  yourfile.txt  20 50 10 20

YOURFILE=$1
shift

# changed to cat | wc so I dont have to remove the filename which comes from
# wc -l
LINES=$(cat $YOURFILE | wc -l ) 

startpct=0;
PART=1;
for pct in $@
do
  # I am assuming that each parameter is on top of the last
  # so   10 30 10   would become 10, 10+30 = 40, 10+30+10 = 50, ...
  endpct=$( echo "$startpct + $pct" | bc)  

  # your math but changed parts of 100 instead of parts of 10.
  #  change bc <<< to echo "..." | bc 
  #  so that one can capture the output into a bash variable.
  FIRSTLINE=$( echo "$LINES * $startpct / 100 + 1" | bc )
  LASTLINE=$( echo "$LINES * $endpct / 100" | bc )

  # use sed every time because the special case for head
  # doesn't really help performance.
  sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt
  $((PART++))
  startpct=$endpct
done

# get the rest if the % dont add to 100%
if [[ $( "lastpct < 100" | bc ) -gt 0 ]] ; then
   sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt
fi

wc -l part*.txt

- Mike Wodarczyk

1

BEGIN {
    split(w, weight)
    total = 0
    for (i in weight) {
        weight[i] += total
        total = weight[i]
    }
}
FNR == 1 {
    if (NR!=1) {
        write_partitioned_files(weight,a)
        split("",a,":") #empty a portably
    }
    name=FILENAME
}
{a[FNR]=$0}
END {
    write_partitioned_files(weight,a)
}
function write_partitioned_files(weight, a) {
    split("",threshold,":")
    size = length(a)
    for (i in weight){
        threshold[length(threshold)] = int((size * weight[i] / total)+0.5)+1
    }
    l=1
    part=0
    for (i in threshold) {
        close(out)
        out = name ".part" ++part
        for (;l<threshold[i];l++) {
            print a[l] " > " out 
        }
    }
}

调用方式：

awk -v w="60 20 20" -f above_script.awk file_to_split1 file_to_split2 ...

在脚本中用>替换" > "，以实际写入分区文件。

变量w期望使用空格分隔的数字。文件按比例进行分区。例如"2 1 1 3"将文件分成四个部分，每个部分的行数比例为2:1:1:3。任何总和为100的数字序列都可以用作百分比。

对于大文件，数组a可能会消耗太多内存。如果有问题，这是一个可替代的awk脚本：

BEGIN {
    split(w, weight)
    for (i in weight) {
        total += weight[i]; weight[i] = total #cumulative sum
    }
}
FNR == 1 {
    #get number of lines. take care of single quotes in filename.
    name = gensub("'", "'\"'\"'", "g", FILENAME)
    "wc -l '" name "'" | getline size

    split("", threshold, ":")
    for (i in weight){
        threshold[length(threshold)+1] = int((size * weight[i] / total)+0.5)+1
    }

    part=1; close(out); out = FILENAME ".part" part
}
{
    if(FNR>=threshold[part]) {
        close(out); out = FILENAME ".part" ++part
    }
    print $0 " > " out 
}

这个方法会对每个文件进行两次操作。第一次是通过 wc -l 命令计算文件行数，第二次则是在写入分割后的文件时。调用和效果与第一种方法类似。

- pii_ke

1

我喜欢 Benjamin W. 的 csplit 解决方案，但它太长了...

#!/bin/bash
# usage ./splitpercs.sh file 60 20 20
n=`wc -l <"$1"` || exit 1
echo $* | tr ' ' '\n' | tail -n+2 | head -n`expr $# - 1` |
  awk -v n=$n 'BEGIN{r=1} {r+=n*$0/100; if(r > 1 && r < n){printf "%d\n",r}}' |
  uniq | xargs csplit -sfpart "$1"

（if(r > 1 && r < n)和uniq位是为了防止创建空文件或对小百分比、行数较少的文件或百分比总和超过100%的文件产生奇怪的行为。）

- webb

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Benjamin W. · Accepted Answer

有一个实用工具，它以行号作为参数，指定每个新文件的第一行：csplit。这是其POSIX版本的包装器：

#!/bin/bash

usage () {
    printf '%s\n' "${0##*/} [-ks] [-f prefix] [-n number] file arg1..." >&2
}

# Collect csplit options
while getopts "ksf:n:" opt; do
    case "$opt" in
        k|s) args+=(-"$opt") ;;           # k: no remove on error, s: silent
        f|n) args+=(-"$opt" "$OPTARG") ;; # f: filename prefix, n: digits in number
        *) usage; exit 1 ;;
    esac
done
shift $(( OPTIND - 1 ))

fname=$1
shift
ratios=("$@")

len=$(wc -l < "$fname")

# Sum of ratios and array of cumulative ratios
for ratio in "${ratios[@]}"; do
    (( total += ratio ))
    cumsums+=("$total")
done

# Don't need the last element
unset cumsums[-1]

# Array of numbers of first line in each split file
for sum in "${cumsums[@]}"; do
    linenums+=( $(( sum * len / total + 1 )) )
done

csplit "${args[@]}" "$fname" "${linenums[@]}"

在指定要拆分的文件名后，需要提供相对于它们总和的分割文件大小比率。也就是说，

percsplit brown.txt 60 20 20
percsplit brown.txt 6 2 2
percsplit brown.txt 3 1 1

所有这些都是等价的。

像问题中的用法如下：

$ percsplit -s -f part -n 1 brown.txt 60 20 20
$ wc -l part*
 34403 part0
 11468 part1
 11468 part2
 57339 total

编号从零开始，但没有 txt 扩展名。 GNU 版本支持一个 --suffix-format 选项，可以添加 .txt 扩展名，并可将其添加到接受的参数中，但这需要比 getopts 更复杂的东西来解析它们。

这个解决方案与非常短的文件相容（将两行的分割文件分成两个），而且重要的工作是由 csplit 自己完成的。