我正在使用Python和正则表达式来查找一个ORF(开放阅读框架)。
查找一个子字符串,它仅由字母ATGC
(没有空格或换行符)组成,该子字符串应满足以下条件:
以ATG
开头,以TAG
、TAA
或TGA
结尾,并且应从第一个字符、第二个字符和第三个字符考虑序列:
Seq= "CCTCAGCGAGGACAGCAAGGGACTAGCCAGGAGGGAGAACAGAAACTCCAGAACATCTTGGAAATAGCTCCCAGAAAAGC
AAGCAGCCAACCAGGCAGGTTCTGTCCCTTTCACTCACTGGCCCAAGGCGCCACATCTCCCTCCAGAAAAGACACCATGA
GCACAGAAAGCATGATCCGCGACGTGGAACTGGCAGAAGAGGCACTCCCCCAAAAGATGGGGGGCTTCCAGAACTCCAGG
CGGTGCCTATGTCTCAGCCTCTTCTCATTCCTGCTTGTGGCAGGGGCCACCACGCTCTTCTGTCTACTGAACTTCGGGGT
GATCGGTCCCCAAAGGGATGAGAAGTTCCCAAATGGCCTCCCTCTCATCAGTTCTATGGCCCAGACCCTCACACTCAGAT
CATCTTCTCAAAATTCGAGTGACAAGCCTGTAGCCCACGTCGTAGCAAACCACCAAGTGGAGGAGCAGCTGGAGTGGCTG
AGCCAGCGCGCCAACGCCCTCCTGGCCAACGGCATGGATCTCAAAGACAACCAACTAGTGGTGCCAGCCGATGGGTTGTA
CCTTGTCTACTCCCAGGTTCTCTTCAAGGGACAAGGCTGCCCCGACTACGTGCTCCTCACCCACACCGTCAGCCGATTTG
CTATCTCATACCAGGAGAAAGTCAACCTCCTCTCTGCCGTCAAGAGCCCCTGCCCCAAGGACACCCCTGAGGGGGCTGAG
CTCAAACCCTGGTATGAGCCCATATACCTGGGAGGAGTCTTCCAGCTGGAGAAGGGGGACCAACTCAGCGCTGAGGTCAA
TCTGCCCAAGTACTTAGACTTTGCGGAGTCCGGGCAGGTCTACTTTGGAGTCATTGCTCTGTGAAGGGAATGGGTGTTCA
TCCATTCTCTACCCAGCCCCCACTCTGACCCCTTTACTCTGACCCCTTTATTGTCTACTCCTCAGAGCCCCCAGTCTGTA
TCCTTCTAACTTAGAAAGGGGATTATGGCTCAGGGTCCAACTCTGTGCTCAGAGCTTTCAACAACTACTCAGAAACACAA
GATGCTGGGACAGTGACCTGGACTGTGGGCCTCTCATGCACCACCATCAAGGACTCAAATGGGCTTTCCGAATTCACTGG
AGCCTCGAATGTCCATTCCTGAGTTCTGCAAAGGGAGAGTGGTCAGGTTGCCTCTGTCTCAGAATGAGGCTGGATAAGAT
CTCAGGCCTTCCTACCTTCAGACCTTTCCAGATTCTTCCCTGAGGTGCAATGCACAGCCTTCCTCACAGAGCCAGCCCCC
CTCTATTTATATTTGCACTTATTATTTATTATTTATTTATTATTTATTTATTTGCTTATGAATGTATTTATTTGGAAGGC
CGGGGTGTCCTGGAGGACCCAGTGTGGGAAGCTGTCTTCAGACAGACATGTTTTCTGTGAAAACGGAGCTGAGCTGTCCC
CACCTGGCCTCTCTACCTTGTTGCCTCCTCTTTTGCTTATGTTTAAAACAAAATATTTATCTAACCCAATTGTCTTAATA
ACGCTGATTTGGTGACCAGGCTGTCGCTACATCACTGAACCTCTGCTCCCCACGGGAGCCGTGACTGTAATCGCCCTACG
GGTCATTGAGAGAAATAA"
我尝试过的方法:
# finding the stop codon here
def stop_codon(seq_0):
for i in range(0,len(seq_0),3):
if (seq_0[i:i+3]== "TAA" and i%3==0) or (seq_0[i:i+3]== "TAG" and i%3==0) or (seq_0[i:i+3]== "TGA" and i%3==0) :
a =i+3
break
else:
a = None
# finding the start codon here
startcodon_find =[m.start() for m in re.finditer('ATG', seq_0)]
我该如何找到检查起始密码子并找到第一个终止密码子的方法。然后找到下一个起始密码子和下一个终止密码子。
我希望将此运行三个框架。如前所述,三个框架将考虑序列的第一个、第二个和第三个字符作为起始点。
此外,序列需要被分成小的3部分。因此应该像这样:
ATG TTT AAA ACA AAA TAT TTA TCT AAC CCA ATT GTC TTA ATA ACG CTG ATT TGA
任何帮助都将不胜感激。
我最终的答案:def orf_find(st0):
seq_0=""
for i in range(0,len(st0),3):
if len(st0[i:i+3])==3:
seq_0 = seq_0 + st0[i:i+3]+ " "
ms_1 =[m.start() for m in re.finditer('ATG', seq_0)]
ms_2 =[m.start() for m in re.finditer('(TAA)|(TAG)|(TGA)', seq_0)]
def get_next(arr,value):
for a in arr:
if a > value:
return a
return -1
codons = []
start_codon=ms_1[0]
while (True):
stop_codon = get_next(ms_2,start_codon)
if stop_codon == -1:
break
codons.append((start_codon,stop_codon))
start_codon = get_next(ms_1,stop_codon)
if start_codon==-1:
break
max_val = 0
selected_tupple = ()
for i in codons:
k=i[1]-i[0]
if k > max_val:
max_val = k
selected_tupple = i
print "selected tupple is ", selected_tupple
final_seq=seq_0[selected_tupple[0]:selected_tupple[1]+3]
print final_seq
print "The longest orf length is " + str(max_val)
output_file = open('Longorf.txt','w')
output_file.write(str(orf_find(st0)))
output_file.close()
上述的写入函数对我在将内容写入文本文件时没有帮助。我在那里得到的只是 NONE。为什么会出现这种错误?有人能帮忙吗?