我写了一段代码,在两个不同文件的line[1]中查找共同的ID。我的输入文件很大(200万行)。如果我将其分割成许多小文件,它会给我更多相交的ID,而如果我运行整个文件,交集会少得多。我无法弄清楚原因,你能建议我什么是错的,并如何改进这个代码以避免问题吗?
fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("result.txt",'w')
dictA = dict()
for line1 in fileA:
listA = line1.split('\t')
dictA[listA[1]] = listA
dictB = dict()
for line1 in fileB:
listB = line1.split('\t')
dictB[listB[1]] = listB
for key in dictB:
if key in dictA:
output.write(dictA[key][0]+'\t'+dictA[key][1]+'\t'+dictB[key][4]+'\t'+dictB[key][5]+'\t'+dictB[key][9]+'\t'+dictB[key][10])
我的文件1按行[0]排序,并有0-15行。
contig17 GRMZM2G052619_P03 98 109 2 0 15 67 78.8 0 127 5 420 0 304 45
contig33 AT2G41790.1 98 420 2 0 21 23 78.8 1 127 5 420 2 607 67
contig98 GRMZM5G888620_P01 87 470 1 0 17 28 78.8 1 127 7 420 2 522 18
contig102 GRMZM5G886789_P02 73 115 1 0 34 45 78.8 0 134 5 421 0 456 50
contig123 AT3G57470.1 83 201 2 1 12 43 78.8 0 134 9 420 0 305 50
我的file2没有排序,有0-10行。
GRMZM2G052619 GRMZM2G052619_P03 4 2345 GO:0043531 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07525 1
GRMZM5G888620 GRMZM5G888620_P01 1 2367 GO:0011551 DNA binding "Any molecular function by which a gene product interacts selectively and non-covalently with DNA" [GOC:jl] molecular_function PF07589 4
GRMZM5G886789 GRMZM5G886789_P02 1 4567 GO:0055516 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07526 0
我的期望输出:
contig17 GRMZM2G052619_P03 GO:0043531 ADP binding molecular_function PF07525
contig98 GRMZM5G888620_P01 GO:0011551 DNA binding molecular_function PF07589
contig102 GRMZM5G886789_P02 GO:0055516 ADP binding molecular_function PF07526
grep
是不够的,我没有观察到输出包含来自两个文件的列。你可以使用join
但输入没有排序:join -j 2 -o 1.1 2.2 2.3 file1 file2
。 - devnullcsv
模块。 - dilbert