[MacOS, Python 2.7]
我正在尝试解析一个 .txt 文件,并提取我想要的字符串,以创建一个制表符分隔的表格。我将不得不为许多文件做到这一点,但我在选择一些字符串时遇到了麻烦。
以下是输入文件示例:
# Assembly name: ASM1844v1
# Organism name: Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name: strain=ACICU
# Taxid: 405416
# BioSample: SAMN02603140
# BioProject: PRJNA17827
# Submitter: CNR - National Research Council
# Date: 2008-4-15
# Assembly type: n/a
# Release type: major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession RefSeq Unit Accession Assembly-Unit name
## GCA_000018455.1 GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
ANONYMOUS assembled-molecule na Chromosome
CP000863.1 = NC_010611.1 Primary Assembly 3904116 na
pACICU1 assembled-molecule pACICU1 Plasmid CP000864.1 = NC_010605.1 Primary Assembly 28279 na
pACICU2 assembled-molecule pACICU2 Plasmid CP000865.1 = NC_010606.1 Primary Assembly 64366 na
到目前为止,我的代码如下所示,headstring表示列标题:
# Open the input file for reading
InFile = open(InFileName, 'r')
#f = open(InFileName, 'r')
# Write the header
Headstring= "GenBank_Assembly_ID RefSeq_Assembly_ID Assembly_level Chromosome Plasmid Refseq_chromosome Refseq_plasmid1 Refseq_plasmid2 Refseq_plasmid3 Refseq_plasmid4 Refseq_plasmid5"
# Set up chromosome and plasmid count
ccount = 0
pcount = 0
# Look for corresponding data from each file
with open(InFileName, 'r') as searchfile:
for line in searchfile:
if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
print GCA.group(1)
GCA = GCA.group(1)
if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
print GCF.group(1)
GCF = GCF.group(1)
if re.search ( r'level: (.+$)', line, re.M|re.I):
assembly = re.search( r'level: (.+$)', line, re.M|re.I)
print assembly.group(1)
assembly = assembly.group(1)
if "Chromosome" in line:
ccount += 1
print ccount
if "Plasmid" in line:
pcount += 1
print pcount
OutputString = "%s\t%s\t%s\t%s\t%s\t" % (GCA, GCF, assembly, ccount, pcount)
OutFile=open(OutFileName, 'w')
OutFile.write(Headstring+'\n'+OutputString)
InFile.close()
OutFile.close()
主要问题是我想提取字符串
NC_010611.1
,NC_010605.1
和NC_010606.1
,并在同一行上它们之间有制表符,以便它们最终出现在Refseq_chromosome,Refseq_plasmid1
和Refseq_plasmid2
标头下。但是,我只希望脚本在assembly = "Chromosome"
或"Complete Genome"
时搜索这些内容。我不知道如何仅在此条件为true
时搜索字符串。我知道获取这些字符串的正则表达式可以是
=\t(\w+..)
,但这就是我的限度。我对Python非常陌生,因此解释会很好。
line.split()
适用于制表符和空格。如果没有看到导致问题的输入,我不知道发生了什么。您可以通过添加导致问题的输入文件来更新您的问题。 - Andriy Makukha