基于分隔符将一个文件拆分为多个文件

Question

基于分隔符将一个文件拆分为多个文件

107

我有一个文件，每个部分后面都用-|作为分隔符...需要使用Unix为每个部分创建单独的文件。

输入文件示例

wertretr
ewretrtret
1212132323
000232
-|
ereteertetet
232434234
erewesdfsfsfs
0234342343
-|
jdhg3875jdfsgfd
sjdhfdbfjds
347674657435
-|

文件1中的预期结果

wertretr
ewretrtret
1212132323
000232
-|

文件2中的期望结果

ereteertetet
232434234
erewesdfsfsfs
0234342343
-|

预期在文件3中得到的结果

jdhg3875jdfsgfd
sjdhfdbfjds
347674657435
-|

- user1499178

1

你是在编写程序还是想要使用命令行工具来完成这个任务？ - rkyser

2

最好使用命令行工具。 - user1499178

你可以使用awk，这样编写一个3或4行的程序就很容易了。不幸的是，我已经不练习了。 - ctrl-alt-delor

12个回答

46

awk '{f="file" NR; print $0 " -|"> f}' RS='-\\|'  input-file

解释（编辑）：

RS 是记录分隔符。这个解决方案使用了一个 GNU awk 的扩展，允许它超过一个字符。 NR 是记录号。

打印语句会将一条记录和" -|"一起输出到一个文件中，该文件的名称包含了记录编号。

- William Pursell

1

RS 是记录分隔符，这个解决方案使用了 GNU awk 扩展，允许它超过一个字符。NR 是记录号。print 语句将一条记录和“-|”一起打印到一个文件中，该文件的名称包含记录号。 - William Pursell

1

@rzetterbeg 这对于大文件应该很有效。awk一次处理一个记录，因此它只读取所需的内容。如果记录分隔符的第一次出现非常晚，则可能会出现内存问题，因为整个记录必须适合内存。此外，请注意，在RS中使用多个字符不是标准awk，但这在gnu awk中可以工作。 - William Pursell

4

对我来说，它在31.728秒内分割成了3.3GB。 - Cleankod

4

文件名就是 > 右侧的字符串，因此您可以按照自己的喜好构建它。例如，print $0 "-|" > "file" NR ".txt" 可以生成文件名为 "file" NR ".txt" 的文件，并将当前行的内容和字符串 " -|" 写入该文件中。 - William Pursell

1

@AGrush 这取决于版本。你可以执行 awk '{f="file" NR; print $0 " -|" > f}'。 - William Pursell

显示剩余8条评论

7

Debian有csplit命令，但我不确定它是否普及于所有/大多数/其他发行版。如果没有，不过，找到源代码并编译它不应该太难...

- twalberg

1

我同意。我的Debian系统显示csplit是gnu coreutils的一部分。因此，任何Gnu操作系统，例如所有的Gnu/Linux发行版都将拥有它。维基百科上也在csplit页面提到了“Single UNIX®规范第7版”，所以我认为你已经得到了它。 - ctrl-alt-delor

3

因为csplit是POSIX标准中的一部分，所以我期望它在所有类Unix系统上都可以使用。 - Jonathan Leffler

1

尽管csplit是POISX，但问题（似乎在我面前的Ubuntu系统上进行测试）是没有明显的方法使其使用更现代的正则表达式语法。比较：csplit --prefix gold-data - "/^==*$/与csplit --prefix gold-data - "/^=+$/。至少GNU grep有“-e”。 - new123456

5

我解决了一个略有不同的问题，即文件包含一行名称，而接下来的文本应该放在其中。这段 Perl 代码对我很有帮助：

#!/path/to/perl -w

#comment the line below for UNIX systems
use Win32::Clipboard;

# Get command line flags

#print ($#ARGV, "\n");
if($#ARGV == 0) {
    print STDERR "usage: ncsplit.pl --mff -- filename.txt [...] \n\nNote that no space is allowed between the '--' and the related parameter.\n\nThe mff is found on a line followed by a filename.  All of the contents of filename.txt are written to that file until another mff is found.\n";
    exit;
}

# this package sets the ARGV count variable to -1;

use Getopt::Long;
my $mff = "";
GetOptions('mff' => \$mff);

# set a default $mff variable
if ($mff eq "") {$mff = "-#-"};
print ("using file switch=", $mff, "\n\n");

while($_ = shift @ARGV) {
    if(-f "$_") {
    push @filelist, $_;
    } 
}

# Could be more than one file name on the command line, 
# but this version throws away the subsequent ones.

$readfile = $filelist[0];

open SOURCEFILE, "<$readfile" or die "File not found...\n\n";
#print SOURCEFILE;

while (<SOURCEFILE>) {
  /^$mff (.*$)/o;
    $outname = $1;
#   print $outname;
#   print "right is: $1 \n";

if (/^$mff /) {

    open OUTFILE, ">$outname" ;
    print "opened $outname\n";
    }
    else {print OUTFILE "$_"};
  }

- John David Smith

请问您能否解释一下这段代码为什么有效？我遇到了与您在此处描述的类似情况——所需的输出文件名嵌入在文件中。但我不是一个常规的Perl用户，无法完全理解这段代码。 - shiri

真正的难点在最后的 while 循环中。如果它在行首找到了 mff 正则表达式，它将使用该行剩余部分作为要打开并开始写入的文件名。它从不关闭任何东西，因此在几十个之后就会用尽文件句柄。 - tripleee

脚本实际上可以通过在最终的 while 循环之前删除大部分代码并切换到 while (<>) 来进行改进。 - tripleee

4

以下命令适用于我，希望能帮到你。

awk 'BEGIN{file = 0; filename = "output_" file ".txt"}
    /-|/ {getline; file ++; filename = "output_" file ".txt"}
    {print $0 > filename}' input

- Thanh

1

通常情况下，这将在几十个文件后用尽文件句柄。解决方法是在开始新文件时显式“关闭”旧文件。 - tripleee

@tripleee 你怎么关闭它（awk初学者问题）。你能提供一个更新的例子吗？ - Jesper Rønn-Jensen

1

@JesperRønn-Jensen 这个框可能太小了，无法提供有用的示例，但基本上是在分配新的“filename”值之前使用if（file）close（filename）;。 - tripleee

找到关闭文件的方法了：; close(filename)。非常简单，但确实修复了上面的示例。 - Jesper Rønn-Jensen

1

@JesperRønn-Jensen 我回滚了您的编辑，因为您提供了一份有错误的脚本。应该避免对其他人答案的重大编辑——如果您认为需要单独的答案，请随意发布自己的新答案（可以作为社区维基）。 - tripleee

显示剩余2条评论

3

您还可以使用awk。我对awk不是很熟悉，但以下内容似乎对我有效。它生成了part1.txt、part2.txt、part3.txt和part4.txt文件。请注意，此生成的最后一个partn.txt文件为空。我不确定如何修复它，但我相信稍微调整一下就可以解决。有任何建议吗？ awk_pattern文件：

BEGIN{ fn = "part1.txt"; n = 1 }
{
   print > fn
   if (substr($0,1,2) == "-|") {
       close (fn)
       n++
       fn = "part" n ".txt"
   }
}

bash command:

awk -f awk_pattern input.file

- rkyser

2

如果您有 csplit 工具，请使用它。

如果没有，但您有Python…请不要使用Perl。

懒惰地读取文件

您的文件可能太大，无法一次性全部加载到内存中 - 按行读取可能更好。假设输入文件名为“samplein”：

$ python3 -c "from itertools import count
with open('samplein') as file:
    for i in count():
        firstline = next(file, None)
        if firstline is None:
            break
        with open(f'out{i}', 'w') as out:
            out.write(firstline)
            for line in file:
                out.write(line)
                if line == '-|\n':
                    break"

- Russia Must Remove Putin

这将把整个文件读入内存，这意味着对于大文件来说效率低下甚至会失败。 - tripleee

1

@tripleee，我已经更新了答案以处理非常大的文件。 - Russia Must Remove Putin

2

以下是一段Python 3脚本，根据提供的分隔符将文件拆分为多个文件。以下是示例输入文件：

# Ignored

######## FILTER BEGIN foo.conf
This goes in foo.conf.
######## FILTER END

# Ignored

######## FILTER BEGIN bar.conf
This goes in bar.conf.
######## FILTER END

以下是脚本内容：

这是脚本：

#!/usr/bin/env python3

import os
import argparse

# global settings
start_delimiter = '######## FILTER BEGIN'
end_delimiter = '######## FILTER END'

# parse command line arguments
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input-file", required=True, help="input filename")
parser.add_argument("-o", "--output-dir", required=True, help="output directory")

args = parser.parse_args()

# read the input file
with open(args.input_file, 'r') as input_file:
    input_data = input_file.read()

# iterate through the input data by line
input_lines = input_data.splitlines()
while input_lines:
    # discard lines until the next start delimiter
    while input_lines and not input_lines[0].startswith(start_delimiter):
        input_lines.pop(0)

    # corner case: no delimiter found and no more lines left
    if not input_lines:
        break

    # extract the output filename from the start delimiter
    output_filename = input_lines.pop(0).replace(start_delimiter, "").strip()
    output_path = os.path.join(args.output_dir, output_filename)

    # open the output file
    print("extracting file: {0}".format(output_path))
    with open(output_path, 'w') as output_file:
        # while we have lines left and they don't match the end delimiter
        while input_lines and not input_lines[0].startswith(end_delimiter):
            output_file.write("{0}\n".format(input_lines.pop(0)))

        # remove end delimiter if present
        if not input_lines:
            input_lines.pop(0)

最后，这是如何运行它的方式：

$ python3 script.py -i input-file.txt -o ./output-folder/

- ctrlc-root

1

cat file| ( I=0; echo -n "">file0; while read line; do echo $line >> file$I; if [ "$line" == '-|' ]; then I=$[I+1]; echo -n "" > file$I; fi; done )

以及格式化版本：

#!/bin/bash
cat FILE | (
  I=0;
  echo -n"">file0;
  while read line; 
  do
    echo $line >> file$I;
    if [ "$line" == '-|' ];
    then I=$[I+1];
      echo -n "" > file$I;
    fi;
  done;
)

- mbonnin

4

一如既往，[cat命令]毫无用处(http://www.iki.fi/era/unix/award.html)。 - tripleee

1

@Reishin 这个链接页面详细解释了如何在任何情况下避免在单个文件上使用 cat。有一个 Stack Overflow 问题有更多讨论（尽管我认为被接受的答案是错误的）; https://dev59.com/pmgt5IYBdhLWcg3w7BuL - tripleee

1

Shell 在这种情况下通常非常低效；如果您无法使用 csplit，则 Awk 解决方案可能比此解决方案更可取（即使您要修复 http://shellcheck.net/ 报告的问题；请注意，它目前无法找到此中所有错误）。 - tripleee

@tripleee但如果任务是不使用awk、csplit等工具，只用bash怎么办？ - Reishin

1

那么 cat 仍然是无用的，脚本的其余部分可以大大简化和纠正；但它仍然会很慢。例如，请参见 https://dev59.com/Q2Yr5IYBdhLWcg3wVYwg - tripleee

只输出一个文件，与原文件相同但不带 .txt 扩展名。 - AGrush

0

尝试这个Python脚本：

import os
import argparse

delimiter = '-|'

parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input-file", required=True, help="input txt")
parser.add_argument("-o", "--output-dir", required=True, help="output directory")

args = parser.parse_args()

counter = 1;
output_filename = 'part-'+str(counter)
with open(args.input_file, 'r') as input_file:
    for line in input_file.read().split('\n'):
        if delimiter in line:
            counter = counter+1
            output_filename = 'part-'+str(counter)
            print('Section '+str(counter)+' Started')
        else:
            #skips empty lines (change the condition if you want empty lines too)
            if line.strip() :
                output_path = os.path.join(args.output_dir, output_filename+'.txt')
                with open(output_path, 'a') as output_file:
                    output_file.write("{0}\n".format(line))

例子：

python split.py -i ./to-split.txt -o ./output-dir

- Mehdi Nazari

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ctrl-alt-delor · Accepted Answer

一行代码，无需编程。（除了正则表达式等）

csplit --digits=2  --quiet --prefix=outfile infile "/-|/+1" "{*}"

测试环境： csplit (GNU coreutils) 8.30

关于在苹果电脑上的使用注意事项

"对于 macOS 用户，请注意，操作系统自带的 csplit 版本无法正常工作。您需要通过 Homebrew 安装 coreutils 中的版本，即 gcsplit。" — @Danial

"还有一点要补充的是，你可以让 macOS 上的那个版本正常工作（至少在高 Sierra 上可以）。你只需要稍微调整一下参数：csplit -k -f=outfile infile "/-\|/+1" "{3}"。不能正常工作的功能是 "{*}"，我必须在分隔符数量上进行具体设置，并且需要添加 -k 以避免它删除所有输出文件，如果找不到最后一个分隔符的话。另外，如果你想使用 --digits ，你需要使用 -n 来代替。" — @Pebbl