Python中从大文件中删除行的最快方法

Question

Python中从大文件中删除行的最快方法

pythonoptimization

26

我在Linux系统上处理一个非常大的文本文件（约11GB）。我正在通过一个程序检查文件是否存在错误。一旦发现错误，我需要修复该行或完全删除该行。然后重复这个过程...

最终，一旦我对此过程感到满意，我将完全自动化它。但是，现在让我们假设我手动操作。

从这个大文件中删除特定行的最快执行时间方式是什么？我想过用Python来完成...但也可以接受其他示例。要删除的行可能任何位置在文件中。

如果使用Python，请假定以下界面：

def removeLine(filename, lineno):

谢谢,

-aj

- AJ.

3

使用 grep -v 命令可能比使用 Python 更快。 - dangerstat

1

一个脚本解决方案是绝对必要的吗？大型文本文件查看器（http://www.swiftgear.com/ltfviewer/features.html）应该能够处理该文件，并且您可以使用正则表达式搜索正确的行。 - Dawson Goodell

@dangerstat - 谢谢，但我不是根据匹配模式来决定删除哪一行。我已经知道要删除的确切行号了。 - AJ.

AJ：sed 正好可以满足你的需求。看一下 d 命令。 - Mark Byers

不用重复这个过程，是否有可能一次性完成所有操作？这样会更加高效。 - John La Rooy

显示剩余9条评论

9个回答

8

在原地修改文件，将有问题的行替换为空格，这样文件的其余部分就不需要在磁盘上重新排列。如果修复的内容不超过要替换的行长度，则也可以在原地"修复"该行。

import os
from mmap import mmap
def removeLine(filename, lineno):
    f=os.open(filename, os.O_RDWR)
    m=mmap(f,0)
    p=0
    for i in range(lineno-1):
        p=m.find('\n',p)+1
    q=m.find('\n',p)
    m[p:q] = ' '*(q-p)
    os.close(f)

如果另一个程序可以更改为输出文件偏移量而不是行号，则可以直接将该偏移量分配给p，而无需进行for循环。

- John La Rooy

3

这里的局限性在于，由于mmap在4GB时会耗尽地址空间，所以它无法与32位Python版本一起使用。详情请参见https://dev59.com/HnI-5IYBdhLWcg3w0MDx。 - Scott Griffiths

1

据我所知，你不能仅仅用Python打开一个txt文件并删除一行。你必须创建一个新文件，并将除了那一行之外的所有内容移动到新文件中。如果你知道具体的行数，那么你可以像这样操作：

f = open('in.txt')
fo = open('out.txt','w')

ind = 1
for line in f:
    if ind != linenumtoremove:
        fo.write(line)
    ind += 1

f.close()
fo.close()

当然，您可以检查行的内容以确定是否要保留它。我还建议，如果您有一整个需要删除/更改的行列表，请在文件中一次性进行所有这些更改。

- Justin Peel

6

只是一个小注释，通常在for循环中使用enumerate()来计算迭代次数更加方便，例如：for ind, line in enumerate(f):。 - catchmeifyoutry

1

如果行的长度是可变的，那么我认为没有比逐行读取文件并写出所有行更好的算法，除了你不想要的那些行。

您可以通过检查某些标准或保持已读行数的累加来识别这些行，并抑制您不想要的行的写入。

如果行的长度固定且您想删除特定行号，则可以使用seek移动文件指针...不过我怀疑您会这么幸运。

- Dancrumb

@Dancrumb - 感谢您的建议。不幸的是，这些行/记录是可变长的。 - AJ.

1

更新：根据评论者的要求，使用sed解决方案。

要删除文件的第二行，请执行以下操作：

sed '2d' input.txt

使用-i开关进行原地编辑。警告：这是一项破坏性操作。阅读此命令的帮助以获取有关如何自动备份的信息。

- Mark Byers

当Mark说“destructive”时，它会真正删除第二行（2d表示第二行，删除）。您可以使用grep的组合来查找行号，然后使用sed将其删除。例如，您想要删除具有文本“danger will danger”的行。您可以使用dangerline = $（grep -n 'danger will danger' <file> | cut -d：-f 1）来获取危险行，然后在该sed之前添加dangerline = $（（$ dangerline +0））以将dangerline转换为整数。然后使用sed -i“$ dangerline d”<file>。 - user3622356

0

def removeLine(filename, lineno):
    in = open(filename)
    out = open(filename + ".new", "w")
    for i, l in enumerate(in, 1):
        if i != lineno:
            out.write(l)
    in.close()
    out.close()
    os.rename(filename + ".new", filename)

- Matt Joiner

0

如果您可以使用awk，例如假设行号为10

$ awk 'NR!=10' file > newfile

- ghostdog74

0

我将根据查找因素（行号或搜索字符串）提供两个选择：

行号

def removeLine2(filename, lineNumber):
    with open(filename, 'r+') as outputFile:
        with open(filename, 'r') as inputFile:

            currentLineNumber = 0 
            while currentLineNumber < lineNumber:
                inputFile.readline()
                currentLineNumber += 1

            seekPosition = inputFile.tell()
            outputFile.seek(seekPosition, 0)

            inputFile.readline()

            currentLine = inputFile.readline()
            while currentLine:
                outputFile.writelines(currentLine)
                currentLine = inputFile.readline()

        outputFile.truncate()

字符串

def removeLine(filename, key):
    with open(filename, 'r+') as outputFile:
        with open(filename, 'r') as inputFile:
            seekPosition = 0 
            currentLine = inputFile.readline()
            while not currentLine.strip().startswith('"%s"' % key):
                seekPosition = inputFile.tell()
                currentLine = inputFile.readline()

            outputFile.seek(seekPosition, 0)

            currentLine = inputFile.readline()
            while currentLine:
                outputFile.writelines(currentLine)
                currentLine = inputFile.readline()

        outputFile.truncate()

- László Papp

0

我认为这里曾经有一个类似的问题，如果不是完全相同的话。逐行阅读（和写入）速度较慢，但您可以一次将更大的块读入内存，跳过不需要的行，然后将其作为单个块写入新文件。重复此过程直到完成。最后用新文件替换原始文件。

需要注意的是，当您读取一个块时，您需要处理最后一个可能是部分行的行，并将其前置到下一个块中。

- Heikki Toivonen

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- K. Brafford · Accepted Answer

你可以同时拥有同一个文件的两个文件对象（一个用于读取，一个用于写入）：

def removeLine(filename, lineno):
    fro = open(filename, "rb")

    current_line = 0
    while current_line < lineno:
        fro.readline()
        current_line += 1

    seekpoint = fro.tell()
    frw = open(filename, "r+b")
    frw.seek(seekpoint, 0)

    # read the line we want to discard
    fro.readline()

    # now move the rest of the lines in the file 
    # one line back 
    chars = fro.readline()
    while chars:
        frw.writelines(chars)
        chars = fro.readline()

    fro.close()
    frw.truncate()
    frw.close()