在Python中对齐DNA序列

Question

在Python中对齐DNA序列

4

我有成千上万个长度在100到5000 bp之间的DNA序列，需要对指定的序列对进行比对并计算相似度得分。 Biopython pairwise2可以很好地处理短序列，但当序列大小超过2kb时，它会出现严重的内存泄漏，导致'MemoryError'错误，即使使用了'score_only'和'one_alignment_only'选项也无济于事！

whole_coding_scores={}
from Bio import pairwise2
for genes in whole_coding: # whole coding is a <25Mb dict providing DNA sequences
   alignment=pairwise2.align.globalxx(whole_coding[genes][3],whole_coding[genes][4],score_only=True,one_alignment_only=True)
   whole_coding_scores[genes]=alignment/min(len(whole_coding[genes][3]),len(whole_coding[genes][4]))

超级计算机返回的结果：

Max vmem         = 256.114G  #Memory usage of the script
failed assumedly after job because:
job 4945543.1 died through signal XCPU (24)

我知道有其他的对齐工具，但它们主要只能在输出文件中写入分数，需要重新读取和解析以检索和使用对齐分数。

是否有任何工具可以在Python环境中对齐序列并返回对齐分数，就像pairwise2一样，但没有内存泄漏？

- Masih

3个回答

2

对于全局比对，可以尝试使用NWalign https://pypi.python.org/pypi/nwalign/。我没有使用过，但似乎可以在脚本中恢复比对分数。

否则，也许EMBOSS工具可以帮助：http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/needleall.html

- jpalmer

1

Biopython现在可以了。Biopython版本1.68中的pairwise2模块速度更快，可以处理更长的序列。以下是新旧pairwise2的比较（在32位Python 2.7.11上，有2 GByte内存限制，64位Win7，Intel Core i5，2.8 GHz）：

from Bio import pairwise2

length_of_sequence = ...
seq1 = '...'[:length_of_sequence]  # Two very long DNA sequences, e.g. human and green monkey
seq2 = '...'[:length_of_sequence]  # dystrophin mRNAs (NM_000109 and XM_007991383, >13 kBp)
aln = pairwise2.align.globalms(seq1, seq2, 5, -4, -10, -1)

旧的pairwise2
- 最大长度/时间：约1,900个字符/10秒
新的pairwise2
- 最大长度/时间：约7,450个字符/12秒
- 1,900个字符的时间：1秒

当score_only设置为True时，新的pairwise2可以在6秒内处理两个约8400个字符的序列。

- Markus

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ward · Accepted Answer

首先，我使用了BioPython的needle。可以在这里找到一个不错的教程（忽略遗留设计 :-)）。

其次：也许您可以通过使用生成器来避免将整个集合读入内存？我不知道您的'whole_coding'对象来自哪里。但是，如果它是一个文件，请确保不要读取整个文件，然后迭代内存对象。例如：

whole_coding = open('big_file', 'rt').readlines() # Will consume memory

但是

for gene in open('big_file', 'rt'):     # will not read the whole thing into memory first
    process(gene)

如果您需要进行处理，可以编写生成器函数：

def gene_yielder(filename):
    for line in open('filename', 'rt'):
        line.strip()   # Here you preprocess your data
        yield line     # This will return

那么

for gene in  gene_yielder('big_file'):
    process_gene(gene)

基本上，你希望你的程序像一条管道一样工作：物品流经它，并得到处理。在制作肉汤时不要将其用作炊具：将所有东西添加进去并加热。我希望这个比喻不会太离谱 :-)