Python：如何统计三个核苷酸的频率

Question

Python：如何统计三个核苷酸的频率

3

我的翻译工作运行正常，但是当我通过断言检查时，它并没有通过，并显示一个错误：应该是字符串而不是元组。我理解这个问题，但是我不知道怎么解决。

AssertionError:

     <class 'tuple'> != <class 'str'>

def frequency(dna_sequence):
    '''
    takes a DNA sequence (in string format) as input, parses it into codons using parse_sequence(),
    counts each type of codon and returns the codons' frequency as a dictionary of counts;
    the keys of the dictionary must be in string format
    '''
    codon_freq = dict()

    # split string with parse_sequence()
    parsed = parse_sequence(dna_sequence) # it's a function made previously, which actually makes a sequence of string to one-element tuple.  

    # count each type of codons in DNA sequence
    from collections import Counter
    codon_freq = Counter(parsed)

    return codon_freq

codon_freq1 = codon_usage(dna_sequence1)
print("Sequence 1 Codon Frequency:\n{0}".format(codon_freq1))

codon_freq2 = codon_usage(dna_sequence2)
print("\nSequence 2 Codon Frequency:\n{0}".format(codon_freq2))

断言检查

assert_equal(codon_usage('ATATTAAAGAATAATTTTATAAAAATATGT'), 
             {'AAA': 1, 'AAG': 1, 'AAT': 2, 'ATA': 3, 'TGT': 1, 'TTA': 1, 'TTT': 1})
assert_equal(type((list(codon_frequency1.keys()))[0]), str)

关于parse_sequence:

def parse_sequence(dna_sequence):
    codons = []

    if len(dna_sequence) % 3 == 0:
        for i in range(0,len(dna_sequence),3):
            codons.append((dna_sequence[i:i + 3],))

    return codons

- colbyjackson

样本数据？请问你能将其编辑为最小可重现示例（MCVE）吗？ - Stedy

请阅读 [mcve] - 您的问题中没有足够的信息。也许提供一个最小化的 parsed 示例和期望的结果会更有帮助。 - wwii

我做了一些修改。如果这使情况变得更好的话。 - colbyjackson

您能提供一下出错的完整堆栈跟踪吗？ - Daniel

2个回答

2

您已经正确解析，但结果是元组而不是所需的字符串，例如：

>>> s = "ATATTAAAGAATAATTTTATAAAAATATGT"
>>> parse_sequence(s)
[('ATA',),
 ('TTA',),
 ('AAG',),
 ('AAT',),
 ('AAT',),
 ('TTT',),
 ('ATA',),
 ('AAA',),
 ('ATA',),
 ('TGT',)]

只需从此行中删除尾随逗号:

    ...
    codons.append((dna_sequence[i:i + 3],))
    ...

了解一下，滑动窗口是一种可以应用于密码子匹配的技术。这里提供一个完整的、简化的例子，使用第三方工具more_itertools.windowed：

import collections as ct

import more_itertools as mit


def parse_sequence(dna_sequence):
    """Return a generator of codons."""
    return ("".join(codon) for codon in mit.windowed(dna_sequence, 3, step=3))

def frequency(dna_sequence):
    """Return a Counter of codon frequency."""
    parsed = parse_sequence(dna_sequence)
    return ct.Counter(parsed)

测试

s = "ATATTAAAGAATAATTTTATAAAAATATGT"
expected = {'AAA': 1, 'AAG': 1, 'AAT': 2, 'ATA': 3, 'TGT': 1, 'TTA': 1, 'TTT': 1}
assert frequency(s) == expected

- pylang

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jq170727 · Accepted Answer

你可以直接使用一个带有推导式的Counter，这样会更容易一些。例如：

>>> s = 'ATATTAAAGAATAATTTTATAAAAATATGT'
>>> [s[3*i:3*i+3] for i in xrange(0, len(s)/3)]
['ATA', 'TTA', 'AAG', 'AAT', 'AAT', 'TTT', 'ATA', 'AAA', 'ATA', 'TGT']
>>> from collections import Counter
>>> Counter([s[3*i:3*i+3] for i in xrange(0, len(s)/3)])
Counter({'ATA': 3, 'AAT': 2, 'AAG': 1, 'AAA': 1, 'TGT': 1, 'TTT': 1, 'TTA': 1})