提高Python脚本的速度

Question

提高Python脚本的速度

python

5

我有一个包含字符串列表的输入文件。

我正在迭代每四行一次，从第二行开始。

从这些行中，我将第一个和最后6个字符组成一个新字符串，并仅在该新字符串唯一时将其放入输出文件中。

我编写的代码可以实现此操作，但我正在处理非常大的深度测序文件，已经运行了一天，进展不大。因此，如果可能的话，我正在寻求任何加快速度的建议。谢谢。

def method():
    target = open(output_file, 'w')

    with open(input_file, 'r') as f:
        lineCharsList = []

        for line in f:
            #Make string from first and last 6 characters of a line
            lineChars = line[0:6]+line[145:151] 

            if not (lineChars in lineCharsList):
                lineCharsList.append(lineChars)

                target.write(lineChars + '\n') #If string is unique, write to output file

            for skip in range(3): #Used to step through four lines at a time
                try:
                    check = line    #Check for additional lines in file
                    next(f)
                except StopIteration:
                    break
    target.close()

- The Nightman

我猜问题在于lineCharsList变得很大时，脚本会变得非常慢。我没有任何建议，但那可能是问题所在。 - Loocid

这也是我所考虑的。由于我正在使用计算集群，RAM 不应该是一个问题。但我不确定是否有比仅将所有内容存储在列表中更好的方法。 - The Nightman

1

另外，您可以在with语句中包含输出文件 - with open(input_file, 'r') as f, open(output_file, 'w') as target:。 - wwii

你使用的是哪个Python版本？ - Veedrac

4个回答

5

你可以使用https://docs.python.org/2/library/itertools.html#itertools.islice：

import itertools

def method():
    with open(input_file, 'r') as inf, open(output_file, 'w') as ouf:
        seen = set()
        for line in itertools.islice(inf, None, None, 4):
            s = line[:6]+line[-6:]
            if s not in seen:
                seen.add(s)
                ouf.write("{}\n".format(s))

- dting

2

除了如Oscar所建议的使用set外，您还可以使用islice来跳过行而不是使用for循环。

正如这篇文章中所述，islice在C中预处理迭代器，因此它应该比使用普通的纯Python for循环要快得多。

- lightalchemist

1

尝试使用以下代码替换：

lineChars = ''.join([line[0:6], line[145:151]])

这可能更有效，具体取决于情况。

- Doug

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Óscar López · Accepted Answer

尝试将lineCharsList定义为set而不是列表：

lineCharsList = set()
...
lineCharsList.add(lineChars)

这将提高in运算符的性能。此外，如果内存完全不是问题，您可能希望将所有输出累积到列表中，并在最后一次性写入，而不是执行多个write()操作。