将几个文件连接在一起并删除标题行。

Question

将几个文件连接在一起并删除标题行。

pythonunixawk

3

有什么好的方法可以连接几个文件，但是删除标题行（标题行数量未知），并将第一个文件标题行保留为新连接文件的标题？

我想在Python中实现这一点，但awk或其他语言也可以，只要我可以使用子进程调用Unix命令。

注意：所有标题行均以#开头。

- Dnaiel

1

请提供一些示例（例如，文件1和文件2及其各自的标题）。 - user1786283

@AnsgarWiechers 文件头行以 # 开始。 - user1786283

是的，我刚刚注意到了。对于噪音感到抱歉。 - Ansgar Wiechers

你知道模式'a'允许在文件末尾写入吗？我问你这个问题是因为如果你不想保留第一个文件的原样，你可以将所有其他文件依次写入第一个文件中。 - eyquem

1

其中一个文件可能非常大吗？文件的处理取决于文件是否可以轻松地在RAM中完全读取一块。 - eyquem

7个回答

5

使用Python编写类似以下代码：

files = ["file1","file2","file3"]

with open("output_file","w") as outfile:
    with open(files[0]) as f1:
        for line in f1:        #keep the header from file1
            outfile.write(line)

    for x in files[1:]:
        with open(x) as f1:
            for line in f1:
                if not line.startswith("#"):
                    outfile.write(line)

你也可以在这里使用fileinput模块：

该模块实现了一个辅助类和函数，以便快速地在标准输入或文件列表上执行循环。

import fileinput
header_over = False
with open("out_file","w") as outfile:
    for line in fileinput.input():
        if line.startswith("#") and not header_over:
            outfile.write(line)
        elif not line.startswith("#"):
            outfile.write(line)
            header_over = True

使用方法：$ python so.py 文件1 文件2 文件3

输入：

文件1：

#header file1
foo
bar

文件2：

#header file2
spam
eggs

文件3：

#header file3
python
file

输出：

#header file1
foo
bar

spam
eggs

python
file

- Ashwini Chaudhary

谢谢！这会很快吗？根据我的经验，有时Python读取长文件需要很长时间。您认为是否有更快的选项或者这样就可以了？ - Dnaiel

文件中除了标题行以外，还有其他可能以“#”开头的行吗？ - chepner

@Dnaiel，我添加了另一种使用fileinput模块的解决方案。 - Ashwini Chaudhary

@AshwiniChaudhary。这太棒了。 - PDash

1

试试这个：

def combine(*files):
    with open("result.txt","w+") as result:
        for i in files:
            with open(i,"r+") as f:
                for line in f:
                    if not line.strip().startswith("#"):
                        result.write(line.rstrip())



combine("file1.txt","file2.txt")

file1.txt:

#header2
body2

file2.txt:

#header2
body2

result.txt

body2body

- user1786283

1

使用GNU awk：

awk '
    ARGIND == 1 { print; next } 
    /^[[:space:]]*#/ { next }
    { print }
' *.txt

- Birei

你的正则表达式需要是/^[[:space:]]*#/ -- 这些命名字符必须放在括号内。 - glenn jackman

@Birei ARGIND 只在 gawk 中存在。 - jaypal singh

1

您可以通过将shell=True传递给subprocess.Popen来调用一个 shell 管道。

cat f.1 ;  grep -v -h '^#' f.2 f.3 f.4 f.5

快速示例

import sys, subprocess
p = subprocess.Popen('''cat f.1 ;  grep -v -h '^#' f.2 f.3 f.4 f.5''', shell=True,
stdout=sys.stdout)
p.wait()

- iruvar

1

我可能会这样做：

#!/usr/bin/env python

import sys 

for i in range(1, len(sys.argv)):
    for line in open(sys.argv[i], "r"):
        if i == 1 or not line.startswith("#"):
            print line.rstrip('\n')

使用文件作为参数运行脚本，并将输出重定向到结果文件中：

$ ./combine.py foo.txt bar.txt baz.txt > result.txt

头部信息将从参数列表的第一个文件中获取（在上面的示例中为foo.txt）。

- Ansgar Wiechers

0

另一个 awk 版本：

awk '!flag && /#/ { print; flag=1; next } flag && /#/ { next } 1' f1 f2 f3

- jaypal singh

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alper · Accepted Answer

我会按照以下步骤进行操作：

(cat file1; sed '/^#/d' file2 file3 file4) > newFile