这是一个我正在尝试解析的复杂制表符分隔文件的示例
ENTRY map0010\tNAME Glycolysis\tDESCRIPTION Glycolysis is the process of converting glucose into pyruvate\tCLASS Metabolism\tDISEASE H00071 Hereditary fructose intolerance\tH00072 Pyruvate dehydrogenase complex deficiency\tDBLINKS GO: 0006096 0006094
ENTRY map00020\tNAME Citrate cycle (TCA cycle)\tCLASS Metabolism; Carbohydrate Metabolism\tDISEASE H00073 Pyruvate carboxylase deficiency\tDBLINKS GO: 0006099\tREL_PATHWAY map00010 Glycolysis / Gluconeogenesis\tmap00053 Ascorbate and aldarate metabolism
我正在尝试获取仅包含某些字段的输出,例如:
ENTRY map0010\tNAME Glycolysis\tCLASS Metabolism\tDISEASE H00071 Hereditary fructose intolerance H00072 Pyruvate dehydrogenase complex deficiency\tDBLINKS GO: 0006096 0006094\tNA
ENTRY map00020\tNAME Citrate cycle (TCA cycle)\tCLASS Metabolism; Carbohydrate Metabolism\tDISEASE H00073 Pyruvate carboxylase deficiency\tDBLINKS GO: 0006099\tREL_PATHWAY map00010 Glycolysis / Gluconeogenesis\tmap00053 Ascorbate and aldarate metabolism
主要问题在于,并不是所有的行都包含相同数量的字段,因此我需要删除例如包含字符串“DESCRIPTION”的字段,并且在字段“CLASS”不存在的行中添加一个空字段。
此外,对于某些字段,数据被分割成多个部分(例如,第1行后面紧跟着DISEASE的字段包含疾病数据!),我需要将它们合并。
我尝试了以下方法:
input = open('file', 'r')
dict = ["ENTRY", "NAME", "CLASS", "DISEASE", "DBLINKS", "REL_PATHWAY"]
split_tab = []
output = []
for line in input:
split_tab.append(line.split('\t'))
for item in dict:
for element in split_tab:
if item in element:
output.append(element)
else:
output.append('\tNA\t')
但它会保留所有内容,而不仅仅是字典中指定的元素。 请问我能为您做些什么呢?