解析多重fasta文件以提取序列

Question

解析多重fasta文件以提取序列

3

我正在尝试用Python编写脚本来解析一个大型fasta文件，但我不想使用biopython，因为我正在学习脚本编写。该脚本需要将访问号、序列长度和GC含量打印到控制台。我已经能够提取出访问号，但无法提取序列，因为它们被读取为行，这阻止了我计算序列长度和GC含量。

有人可以帮帮我吗？我已经尝试将这些行分组成一个列表，但那样会在一个列表内创建多个列表，而且我也不确定如何将它们连接起来。

seq=""
seqcount=0
seqlen=0
gc=0

#prompt user for file name
infile=input("Enter the name of your designated .fasta file: ")

with open(infile, "r") as fasta:
    print("\n")
    print ("Accession Number \t Sequence Length \t GC content (%)")
    for line in fasta:
       line.strip()
       if line[0]==">":
           seqcount+=1 #counts number sequences in file
           accession=line.split("|")[3] #extract accession
           seq=""
       else: 
           seq+=line[:-1]
           seqlen=len(seq)
           print(accession, "\t \t", seqlen)

print("\n")           
print("There are a total of", seqcount, "sequences in this file.")

- k.smith

1

什么是fasta文件？它有什么格式？我想不多的人知道它。 - Eugene Lisitsky

我认为你所在的领域有点小众，正如@EugeneLisitsky所说，很少有人知道fasta文件的样子，也无法帮助你。如果你想自己学习如何做到这一点，最好的方法是阅读biopython方法。https://github.com/biopython/biopython/blob/master/Bio/SeqIO/FastaIO.py - Dalvenjia

1

一个 .fasta 文件是一种基于文本的格式，通常用于表示核苷酸或肽。头部或基因名称通常长这样：

基因|访问号 ACTGACTAGGGACTGADEA

- k.smith

这种情况很有趣 - 我熟悉Python和遗传学（密码子三联体等），但我从未使用过fasta或biopython。因此，良好的定义非常有用。谢谢;) - Eugene Lisitsky

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Poshi · Accepted Answer

您离正确的代码并不远：

seq=""
seqcount=0

#prompt user for file name
infile=input("Enter the name of your designated .fasta file: ")

def pct_gc(s):
    gc = s.count('G') + s.count('C') + s.count('g') + s.count('c')
    total = len(s)

    return gc*100.0/total

with open(infile, "r") as fasta:
    print("\n")
    print ("Accession Number\tSequence Length\tGC content (%)")
    for line in fasta:
        line = line.strip()
        if line[0]==">":
            if seq != "":
                print("{}\t{}\t{}".format(accession, pct_gc(seq), len(seq)))

            seqcount+=1 #counts number sequences in file
            accession=line.split("|")[3] #extract accession
            seq=""
        else: 
            seq+=line[:-1]

print("{}\t{}\t{}".format(accession, pct_gc(seq), len(seq)))
print("\n")           
print("There are a total of " + str(seqcount) + " sequences in this file.")

需要留意的事项：

在每次迭代中不需要更新长度，只需在最后计算即可。
str.strip() 不会修改对象，而是返回一个剥离空格的对象。
您必须利用已知下一个序列时已读取完整个序列且序列不为空的事实。在此时，您必须编写输出。
最后一个序列没有以新的访问标识符结束，因此您必须在循环后独立处理它。
使用字符串格式或连接字符串。如果您只是将字符串和变量用逗号隔开放置，则会得到元组表示形式的输出。