通过解析 .txt 文件创建制表符分隔的输出文件

4

[MacOS, Python 2.7]

我正在尝试解析一个 .txt 文件,并提取我想要的字符串,以创建一个制表符分隔的表格。我将不得不为许多文件做到这一点,但我在选择一些字符串时遇到了麻烦。

以下是输入文件示例:

# Assembly name:  ASM1844v1
# Organism name:  Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name:  strain=ACICU
# Taxid:          405416
# BioSample:      SAMN02603140
# BioProject:     PRJNA17827
# Submitter:      CNR - National Research Council
# Date:           2008-4-15
# Assembly type:  n/a
# Release type:   major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession   RefSeq Unit Accession   Assembly-Unit name
## GCA_000018455.1  GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role   Assigned-Molecule   Assigned-Molecule-Location/Type GenBank-Accn    Relationship    RefSeq-Accn Assembly-Unit   Sequence-Length UCSC-style-name
ANONYMOUS   assembled-molecule  na  Chromosome
CP000863.1  =   NC_010611.1 Primary Assembly    3904116 na
pACICU1 assembled-molecule  pACICU1 Plasmid CP000864.1  =   NC_010605.1 Primary Assembly    28279   na
pACICU2 assembled-molecule  pACICU2 Plasmid CP000865.1  =   NC_010606.1 Primary Assembly    64366   na

到目前为止,我的代码如下所示,headstring表示列标题:

# Open the input file for reading 
InFile = open(InFileName, 'r')
#f = open(InFileName, 'r')

# Write the header
Headstring= "GenBank_Assembly_ID    RefSeq_Assembly_ID  Assembly_level  Chromosome Plasmid  Refseq_chromosome   Refseq_plasmid1 Refseq_plasmid2 Refseq_plasmid3 Refseq_plasmid4 Refseq_plasmid5"

# Set up chromosome and plasmid count
ccount = 0
pcount = 0

# Look for corresponding data from each file
with open(InFileName, 'r') as searchfile:
    for line in searchfile:
        if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
            GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
            print GCA.group(1)
            GCA = GCA.group(1)
        if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
            GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
            print GCF.group(1)
            GCF = GCF.group(1) 
        if re.search ( r'level: (.+$)', line, re.M|re.I):
            assembly = re.search( r'level: (.+$)', line, re.M|re.I)
            print assembly.group(1)
            assembly = assembly.group(1)
        if "Chromosome" in line:
            ccount += 1
            print ccount
        if "Plasmid" in line:
            pcount += 1
            print pcount
    
        

OutputString = "%s\t%s\t%s\t%s\t%s\t" % (GCA, GCF, assembly, ccount, pcount)


OutFile=open(OutFileName, 'w')
OutFile.write(Headstring+'\n'+OutputString)


InFile.close()
OutFile.close()

主要问题是我想提取字符串NC_010611.1NC_010605.1NC_010606.1,并在同一行上它们之间有制表符,以便它们最终出现在Refseq_chromosome,Refseq_plasmid1Refseq_plasmid2标头下。但是,我只希望脚本在assembly = "Chromosome""Complete Genome"时搜索这些内容。我不知道如何仅在此条件为true时搜索字符串。
我知道获取这些字符串的正则表达式可以是=\t(\w+..),但这就是我的限度。
我对Python非常陌生,因此解释会很好。
2个回答

3

看看这个例子:

import re

InFileName  = 'YOUR_INPUT_FILE_NAME'
OutFileName = 'YOUR_OUTPUT_FILE_NAME'

# Write the header
Headstring= "GenBank_Assembly_ID\tRefSeq_Assembly_ID\tAssembly_level\tChromosome\tPlasmid\tRefseq_chromosome\tRefseq_plasmid1\tRefseq_plasmid2\tRefseq_plasmid3\tRefseq_plasmid4\tRefseq_plasmid5"

# Look for corresponding data from each file
with open(InFileName, 'r') as InFile, open(OutFileName, 'w') as OutFile:
    chromosomes = []
    plasmids = []
    for line in InFile:
        if line.lstrip()[0] == '#':
            # Process header part of the file differently from the data part
            if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
                GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
                print GCA.group(1)
                GCA = GCA.group(1)
            if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
                GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
                print GCF.group(1)
                GCF = GCF.group(1)
            if re.search ( r'level: (.+$)', line, re.M|re.I):
                assembly = re.search( r'level: (.+$)', line, re.M|re.I)
                print assembly.group(1)
                assembly = assembly.group(1)
        elif assembly in ['Chromosome', 'Complete Genome']:
            # Process each data line separately
            split_line = line.split()
            Type = split_line[3]
            RefSeq_Accn = split_line[6]
            if Type == "Chromosome":
                chromosomes.append(RefSeq_Accn)
            if Type == "Plasmid":
                plasmids.append(RefSeq_Accn)

    # Merge names of up to N chromosomes
    N = 1
    cstr = ''
    for i in range(N):
        if i < len(chromosomes):
            nextChromosome = chromosomes[i]
        else:
            nextChromosome = ''
        cstr += '\t' + nextChromosome

    # Merge names of up to M plasmids
    M = 5
    pstr = ''
    for i in range(M):
        if i < len(plasmids):
            nextPlasmid = plasmids[i]
        else:
            nextPlasmid = ''
        pstr += '\t' + nextPlasmid

    OutputString = "%s\t%s\t%s\t%s\t%s" % (GCA, GCF, assembly, len(chromosomes), len(plasmids))
    OutputString += cstr
    OutputString += pstr

    OutFile.write(Headstring+'\n'+OutputString)

输入:

# Assembly name:  ASM1844v1
# Organism name:  Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name:  strain=ACICU
# Taxid:          405416
# BioSample:      SAMN02603140
# BioProject:     PRJNA17827
# Submitter:      CNR - National Research Council
# Date:           2008-4-15
# Assembly type:  n/a
# Release type:   major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession   RefSeq Unit Accession   Assembly-Unit name
## GCA_000018455.1  GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role   Assigned-Molecule   Assigned-Molecule-Location/Type GenBank-Accn     Relationship    RefSeq-Accn Assembly-Unit   Sequence-Length UCSC-style-name
ANONYMOUS   assembled-molecule  na  Chromosome CP000863.1  =   NC_010611.1 Primary Assembly    3904116 na
pACICU1 assembled-molecule  pACICU1 Plasmid CP000864.1  =   NC_010605.1 Primary Assembly    28279   na
pACICU2 assembled-molecule  pACICU2 Plasmid CP000865.1  =   NC_010606.1 Primary Assembly    64366   na

输出:

GenBank_Assembly_ID  RefSeq_Assembly_ID      Assembly_level  Chromosome  Plasmid Refseq_chromosome  Refseq_plasmid1 Refseq_plasmid2  Refseq_plasmid3 Refseq_plasmid4  Refseq_plasmid5
GCA_000018445.1      GCF_000018445.1         Complete Genome 1           2       NC_010611.1        NC_010605.1     NC_010606.1

与您的脚本相比,主要差异如下:

  • 我使用条件if line.lstrip()[0] == '#'来处理“标题”行(以井号字符开头的行)和底部的“表格行”(实际包含每个序列数据的行)不同。
  • 我使用条件if assembly in ['Chromosome', 'Complete Genome'] - 这是您在问题中指定的条件。
  • 我将每个表格行分成值,例如split_line = line.split()。之后,我通过Type = split_line[3](这是表格数据中的第四列)获取类型,RefSeq_Accn = split_line[6]给出了表格中的第七列。

谢谢,这正是我要找的输出!但是我认为有些东西出了问题——我的输出给了我0条染色体和0个质粒,并且在输出文件中,“完整基因组”后面有一个“¿”。有没有可能是它不适用于我? - D.Parker
@D.Parker,我没有看到任何可能导致你的问题的东西。我建议你使用我的代码来运行我的输入,看看会发生什么。如果你在处理其他输入时遇到了这个问题,那么它可能只是其他格式,而不是你的问题中指定的格式。 - Andriy Makukha
啊,原来是我使用的输入文件有问题,格式不完全一样,我道歉!感谢你的帮助! - D.Parker
抱歉,我还有一个问题——实际上,在我的输入文件中,表格行是通过制表符进行解析的(而不是空格)——我以为添加split_line = line.split('\t')就可以解决这个问题,但事实并非如此……你有什么建议可以尝试一下,或者我是否犯了语法错误? - D.Parker
@D.Parker,line.split()适用于制表符和空格。如果没有看到导致问题的输入,我不知道发生了什么。您可以通过添加导致问题的输入文件来更新您的问题。 - Andriy Makukha
啊,我只需要将输入文件指定为“rU”-因为我有一台Mac,所以换行符不同吗?但无论如何,非常感谢你! - D.Parker

0
你可以先将所有数据读入pandas dataframe中开始处理。 然后,您可以以另一列为条件来处理一个列(其中包含“NC_010611.1”)。请参见此处的示例:Pandas conditional creation of a series/dataframe column
可能可以通过一次遍历数据来获得所需结果,但如果您通过两次遍历数据来实现,则可能更容易编写和阅读。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接