高效地在两个庞大的字典中找到交集区域

3
我写了一段代码,在两个不同文件的line[1]中查找共同的ID。我的输入文件很大(200万行)。如果我将其分割成许多小文件,它会给我更多相交的ID,而如果我运行整个文件,交集会少得多。我无法弄清楚原因,你能建议我什么是错的,并如何改进这个代码以避免问题吗?
fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("result.txt",'w')

dictA = dict()
for line1 in fileA:
    listA = line1.split('\t')
    dictA[listA[1]] = listA

dictB = dict()
for line1 in fileB:
    listB = line1.split('\t')
    dictB[listB[1]] = listB

for key in dictB:
    if key in dictA:
        output.write(dictA[key][0]+'\t'+dictA[key][1]+'\t'+dictB[key][4]+'\t'+dictB[key][5]+'\t'+dictB[key][9]+'\t'+dictB[key][10])

我的文件1按行[0]排序,并有0-15行。

contig17    GRMZM2G052619_P03  98 109 2 0 15 67 78.8 0 127 5 420 0 304 45
contig33    AT2G41790.1        98 420 2 0 21 23 78.8 1 127 5 420 2 607 67
contig98    GRMZM5G888620_P01  87 470 1 0 17 28 78.8 1 127 7 420 2 522 18  
contig102   GRMZM5G886789_P02  73 115 1 0 34 45 78.8 0 134 5 421 0 456 50  
contig123   AT3G57470.1        83 201 2 1 12 43 78.8 0 134 9 420 0 305 50

我的file2没有排序,有0-10行。

GRMZM2G052619 GRMZM2G052619_P03 4 2345 GO:0043531 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07525  1        
GRMZM5G888620 GRMZM5G888620_P01 1 2367 GO:0011551 DNA binding "Any molecular function by which a gene product interacts selectively and non-covalently with DNA" [GOC:jl] molecular_function PF07589  4    
GRMZM5G886789 GRMZM5G886789_P02 1 4567 GO:0055516 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07526 0    

我的期望输出:

contig17    GRMZM2G052619_P03  GO:0043531 ADP binding molecular_function PF07525
contig98    GRMZM5G888620_P01  GO:0011551 DNA binding molecular_function PF07589 
contig102   GRMZM5G886789_P02  GO:0055516 ADP binding molecular_function PF07526  

@user3224522 在这里 grep 是不够的,我没有观察到输出包含来自两个文件的列。你可以使用 join 但输入没有排序:join -j 2 -o 1.1 2.2 2.3 file1 file2 - devnull
@user3224522 你第二次问同样的问题。http://stackoverflow.com/questions/23385685/finding-common-ids-intersection-in-two-dictionaries 不要这样做。 - Jan Vlcinsky
1
@JanVlcinsky,这是我的问题,我改变了问题,因为一些用户抱怨新问题应该是自包含的,所以我不得不重新编写整个上下文。 - user3224522
1
如果输入数据是以制表符分隔的,请考虑使用Python的csv模块。 - dilbert
我的line[0]对应于第1列,而不是行,line[1]对应于第2列。你能建议我如何更好地解析我的文件,以便不会丢失ID吗?谢谢。 - user3224522
显示剩余9条评论
2个回答

2
我建议您使用PANDAS来处理这种问题。以下是使用pandas进行简单证明的方法:
import pandas as pd  #install this, and read de docs
from StringIO import StringIO #You dont need this

#simulating a reading the file 
first_file = """contig17 GRMZM2G052619_P03 x
contig33 AT2G41790.1 x
contig98 GRMZM5G888620_P01 x
contig102 GRMZM5G886789_P02 x
contig123 AT3G57470.1 x"""

#simulating reading the second file
second_file = """y GRMZM2G052619_P03 y
y GRMZM5G888620_P01 y
y GRMZM5G886789_P02 y"""

#here is how you open the files. Instead using StringIO
#you will simply the file path. Give the correct separator
#sep="\t" (for tabular data). Here im using a space.
#In name, put some relevant names for your columns
f_df = pd.read_table(StringIO(first_file), 
                     header=None, 
                     sep=" ", 
                     names=['a', 'b', 'c'])
s_df = pd.read_table(StringIO(second_file), 
                     header=None, 
                     sep=" ", 
                     names=['d', 'e', 'f'])
#this is the hard bit. Here I am using  a bit of my experience with pandas
#Basicly it select the rows in the second data frame, which "isin"
#in the second columns for each data frames. 
my_df = s_df[s_df.e.isin(f_df.b)]

输出:

Out[180]:
    d   e                   f
0   y   GRMZM2G052619_P03   y
1   y   GRMZM5G888620_P01   y
2   y   GRMZM5G886789_P02   y
#you can save this with:
my_df.to_csv("result.txt", sep="\t")

chers!


注意:pandas值得安装。它能够轻松地打开大量数据并进行高度控制,眨眼之间完成。 - tbrittoborges

1
这几乎相同,但在一个函数内部。
#Creates a function to do the reading for each file
def read_store(file_, dictio_): 
    """Given a file name and a dictionary stores the values
    of the file in a dictionary by its value on the column provided."""
    import re 
    with open(file_,'r') as file_0:
        lines_file_0 = fileA.readlines()
    for line in lines_file_0:
        ID = re.findall("^.+\s+(\w+)", line) 
    #I couldn't check it but it should match whatever is after a separate
    # character that has letters, numbers or underscore
        dictio_[ID] = line

使用方法:

file1 = {}
read_store("file1.txt", file1)

然后像平常一样进行比较,但我想使用\s来分割,而不是\t。尽管它也会在单词之间分割,但很容易用" ".join(DictA[1:5])重新组合。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接