如何提高Python脚本的运行速度？

Question

如何提高Python脚本的运行速度？

6

我是一个新手Python程序员，写了一个（可能非常丑陋）的脚本，用于从fastq文件中随机选择一组序列。一个fastq文件将信息存储在每个块中的四行中。每个块中的第一行以字符“@”开头。我使用的输入文件是36GB的fastq文件，包含大约14,000,000行。

我试图重新编写一个已经存在的脚本，该脚本使用了太多的内存，我成功地减少了内存使用量。但是脚本运行起来需要很长时间，我不知道原因在哪里。

parser = argparse.ArgumentParser()
parser.add_argument("infile", type = str, help = "The name of the fastq input file.", default = sys.stdin)
parser.add_argument("outputfile", type = str, help = "Name of the output file.")
parser.add_argument("-n", help="Number of sequences to sample", default=1)
args = parser.parse_args()


def sample():
    linesamples = []
    infile = open(args.infile, 'r')
    outputfile = open(args.outputfile, 'w')
    # count the number of fastq "chunks" in the input file:
    seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
    # randomly select n fastq "chunks":
    seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
    # make a list of the lines that are to be fetched from the fastq file:
    for i in seqsamples:
        linesamples.append(int(4*i+0))
        linesamples.append(int(4*i+1))
        linesamples.append(int(4*i+2))
        linesamples.append(int(4*i+3))
    # fetch lines from input file and write them to output file.
    for i, line in enumerate(infile):
        if i in linesamples:
            outputfile.write(line)

grep步骤实际上需要极少的时间，但是经过500多分钟后，脚本仍然没有开始写入输出文件。因此我认为是在grep和最后的for循环之间的某个步骤需要这么长的时间。但我不知道具体是哪一步，以及我可以采取什么措施来加速。

- Sandra

7

你应该考虑使用分析工具来检查程序哪些步骤造成了停顿。同时，你尝试在较小的文件上运行代码，看看它是否能够完成运行。另外一个优化步骤是考虑使用线程和多进程来拆分任务。 - Jerome Anthony

不要在循环内部反复调用 int。此外，使用 with 语句。 - Veedrac

4个回答

1

你说grep运行非常快，因此在这种情况下，不仅使用grep来计算@字符出现的次数，而且使用grep输出每个@字符看到的字节偏移量（使用grep的-b选项）。然后，使用random.sample选择任何你想要的块。一旦选择了所需的字节偏移量，就使用infile.seek前往每个字节偏移量并从那里打印出4行。

- randomusername

0

您可以使用蓄水池抽样算法。使用该算法，您仅需一次读取数据（无需预先计算文件的行数），因此您可以通过管道将数据传递到脚本中。维基百科页面中有Python示例代码。

同时，Heng Li的seqtk中还有一个针对fastq采样的C实现。

- A.P.

0

尝试将您的代码并行化。我的意思是，你有14,000,000行输入。

首先对输入的内容进行 grep 和筛选，将其写入 filteredInput.txt
将筛选后的文件拆分成大小为10,000-100,000行的文件，例如 filteredInput001.txt、filteredInput002.txt
在这些拆分的文件上运行代码，并将输出分别写入不同的文件，例如 output001.txt、output002.txt
最后合并结果。

由于您的代码根本无法工作，因此您可能需要在这些筛选的输入上运行您的代码。您的代码将检查 filteredInput 文件的存在，并了解他所在的步骤，然后从该步骤恢复。

您还可以使用 shell 或 python 线程，在第一步之后以此方式使用多个 python 进程。

- Atilla Ozgur

1

在优化算法之前建议并行化可能不是一个好主意。通过使用正确的算法，IO将成为瓶颈，而不是CPU。 - cel

@cel他的代码现在甚至都不能正常工作，但是将问题分割并进行并行化并不是一个好主意。 - Atilla Ozgur

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- vikramls · Accepted Answer

根据 linesamples 的大小，if i in linesamples 会花费很长时间，因为你要在每次迭代 infile 时搜索整个列表。你可以将其转换为一个 set 以提高查找速度。同时，enumerate 不够高效 - 我已经用一个 line_num 构造替换了它，我们在每次迭代中递增它。

def sample():
    linesamples = set()
    infile = open(args.infile, 'r')
    outputfile = open(args.outputfile, 'w')
    # count the number of fastq "chunks" in the input file:
    seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
    # randomly select n fastq "chunks":
    seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
    for i in seqsamples:
        linesamples.add(int(4*i+0))
        linesamples.add(int(4*i+1))
        linesamples.add(int(4*i+2))
        linesamples.add(int(4*i+3))
    # make a list of the lines that are to be fetched from the fastq file:
    # fetch lines from input file and write them to output file.
    line_num = 0
    for line in infile:
        if line_num in linesamples:
            outputfile.write(line)
        line_num += 1
    outputfile.close()