Python打印文件中特定行的内容

Question

Python打印文件中特定行的内容

3

背景：

                    Table$Gene=Gene1
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0   2872     208    0.928 0.00484        0.918        0.937
    1   2664     304    0.822 0.00714        0.808        0.836
    2   2360     104    0.786 0.00766        0.771        0.801
    3   2256      48    0.769 0.00787        0.754        0.784
    4   2208      40    0.755 0.00803        0.739        0.771
    5   2256      48    0.769 0.00787        0.754        0.784
    6   2208      40    0.755 0.00803        0.739        0.771

                Table$Gene=Gene2
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0   2872     208    0.938 0.00484        0.918        0.937
    1   2664     304    0.822 0.00714        0.808        0.836
    2   2360     104    0.786 0.00766        0.771        0.801
    3   2256      48    0.769 0.00787        0.754        0.784
    4   1000      40    0.744 0.00803        0.739        0.774
#There is a new line ("\n") here too, it just doesn't come out in the code.

我希望你能够把上面的文件转换成下面这样的输出结果，看起来很简单：

Gene1  0.755
Gene2  0.744

即每个部分中每个基因和生存列中的最后一个数字。

我尝试过多种方法，包括使用正则表达式、将文件读入列表并使用“.next()”方法。下面是我尝试过的一段代码：

fileopen = open(sys.argv[1]).readlines()  # Read in the file as a list.
for index,line in enumerate(fileopen):   # Enumerate items in list
    if "Table" in line:  # Find the items with "Table" (This will have my gene name)
            line2 = line.split("=")[1]  # Parse line to get my gene name
            if "\n" in fileopen[index+1]: # This is the problem section.
                print fileopen[index]
            else:
                fileopen[index+1]

如您在问题部分所见，我试图在这个尝试中表达：

如果列表中的下一项是新行，则打印该项；否则，下一行就是当前行（然后我可以拆分该行以提取我想要的特定数字）。

如果有人能更正代码，让我看看我做错了什么，我会非常感激。

- user1288515

最后一行fileopen[index+1]你尝试实现什么不是很清楚，请问你能解释一下你的意图吗？ - miindlek

是的，抱歉。我想说的是“首先，找到包含Table的那一行。其次，遍历文件中的每一行。如果当前行的下一行（即fileopen[index+1]）是“\n”，则打印当前行（fileopen[index]）”。然后它会告诉我带有“Gene”的那一行，以及换行符之前的那一行，其中包含我想要的分数。（我知道如何解析带有分数的行，以便只提取我想要的分数）。 - user1288515

5个回答

0

我尝试过，这个方法可行：

gene = 1
for i in range(len(filelines)):
    if filelines[i].strip() == "":
        print("Gene" + str(gene) + " " + filelines[i-1].split()[3])
        gene += 1

- laurencevs

非常感谢。所有这些都是很好的方法，真的帮了我很多。我试图点击它们旁边的勾选标记，告诉人们所有的方法都很好，但似乎只允许我选择一个作为“正确”答案。由于它们都很棒，我没有选择任何一个。我还不能“点赞”，因为我刚开始使用。但非常感谢，我很感激你的帮助。 - user1288515

我已经为你的问题投了赞成票，所以现在你应该能够投赞成票给答案了（你现在有15+的声望）。 - laurencevs

0

你可以尝试像这样做（我将你的数据复制到foo.dat中）;

In [1]: with open('foo.dat') as input:
   ...:     lines = input.readlines()
   ...:

使用 with 可以确保在读取完文件后关闭它。

In [3]: lines = [ln.strip() for ln in lines]

这将去除额外的空格。

In [5]: startgenes = [n for n, ln in enumerate(lines) if ln.startswith("Table")]

In [6]: startgenes
Out[6]: [0, 10]

In [7]: emptylines = [n for n, ln in enumerate(lines) if len(ln) == 0]

In [8]: emptylines
Out[8]: [9, 17]

使用emptylines依赖于记录之间仅包含空白的行。

In [9]: lastlines = [n-1 for n, ln in enumerate(lines) if len(ln) == 0]

In [10]: for first, last in zip(startgenes, lastlines):
   ....:     gene = lines[first].split("=")[1]
   ....:     num = lines[last].split()[-1]
   ....:     print gene, num
   ....:     
Gene1 0.771
Gene2 0.774

- Roland Smith

0

这是我的解决方案：

>>> with open('t.txt','r') as f:
...     for l in f:
...         if "Table" in l:
...             gene = l.split("=")[1][:-1]
...         elif l not in ['\n', '\r\n']:
...             surv = l.split()[3]
...         else:
...             print gene, surv
...
Gene1 0.755
Gene2 0.744

- fredtantini

0

不要检查新行，只需在读取文件完成后直接打印

lines = open("testgenes.txt").readlines()
table = ""
finalsurvival = 0.0
for line in lines:
    if "Table" in line:
        if table != "": # print previous survival
            print table, finalsurvival
        table = line.strip().split('=')[1]
    else:
        try:                
            finalsurvival = line.split('\t')[4]
        except IndexError:
            continue
print table, finalsurvival

- Rhand

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Joop · Accepted Answer

使用现有的包，如pandas，读取csv文件，而不是手动编写每个数据项的解析器，可能有些过度，但只需要编写一小部分代码来指定文件中的相关行。未优化的代码（读取文件两次）：

import pandas as pd
def genetable(gene):
    l = open('gene.txt').readlines()
    l += "\n"  # add newline to end of file in case last line is not newline
    lines = len(l)
    skiprows = -1
    for (i, line) in enumerate(l):
        if "Table$Gene=Gene"+str(gene) in line:
            skiprows = i+1
        if skiprows>=0 and line=="\n":
            skipfooter = lines - i - 1
            df = pd.read_csv('gene.txt', sep='\t', engine='python', skiprows=skiprows, skipfooter=skipfooter)
            #  assuming tab separated data given your inputs. change as needed
            # assert df.columns.....
            return df
    return "Not Found"

这将读取一个DataFrame，其中包含该文件中的所有相关数据。

然后可以执行：

genetable(2).survival  # series with all survival rates
genetable(2).survival.iloc[-1]   last item in survival

这样做的好处是您可以访问所有项，文件的任何格式错误可能会更好地被检测出来，从而防止使用不正确的值。如果是我的代码，我会在返回pandas DataFrame之前添加对列名的断言。希望尽早发现解析中的任何错误，以便它不会传播。