在Python中，是否有一种简洁的方法来比较两个文本文件的内容是否相同？

Question

在Python中，是否有一种简洁的方法来比较两个文本文件的内容是否相同？

pythonfilecompare

77

我不关心它们之间的差异，我只想知道内容是否不同。

- Corey Trager

10个回答

36

如果你想要基本的效率，你可能首先要检查文件大小：

if os.path.getsize(filename1) == os.path.getsize(filename2):
  if open('filename1','r').read() == open('filename2','r').read():
    # Files are the same.

这可以节省您阅读两个大小不同且不能相同的文件的每一行的时间。

(甚至更进一步的是，您可以调用每个文件的快速MD5sum并比较它们，但这不是“在Python中”，所以我就到此为止了。)

- Rich

6

仅有两个文件时，使用md5sum方法会较慢（因为你仍需要读取文件以计算校验和）。只有在多个文件中查找重复项时才会收到好处。 - Brian

@Brian：您假设md5sum的文件读取速度不比Python快，而且从将整个文件作为字符串读入Python环境中也没有开销！试试用2GB的文件测试一下吧... - Rich

3

没有理由期望md5sum的文件读取速度比Python的快 - IO与语言无关。大文件问题是迭代使用块（或使用filecmp）的原因，而不是在需要支付额外CPU惩罚的情况下使用md5。 - Brian

6

特别是在考虑文件不完全相同的情况下，按块比较可以提前退出，但md5sum必须继续读取整个文件。 - Brian

14

这是一个函数式的文件比较函数。如果文件大小不同，它会立即返回 False；否则，它以 4KiB 块大小读取文件，并在第一个不同之处立即返回 False：

from __future__ import with_statement
import os
import itertools, functools, operator
try:
    izip= itertools.izip  # Python 2
except AttributeError:
    izip= zip  # Python 3

def filecmp(filename1, filename2):
    "Do the two files have exactly the same contents?"
    with open(filename1, "rb") as fp1, open(filename2, "rb") as fp2:
        if os.fstat(fp1.fileno()).st_size != os.fstat(fp2.fileno()).st_size:
            return False # different sizes ∴ not equal

        # set up one 4k-reader for each file
        fp1_reader= functools.partial(fp1.read, 4096)
        fp2_reader= functools.partial(fp2.read, 4096)

        # pair each 4k-chunk from the two readers while they do not return '' (EOF)
        cmp_pairs= izip(iter(fp1_reader, b''), iter(fp2_reader, b''))

        # return True for all pairs that are not equal
        inequalities= itertools.starmap(operator.ne, cmp_pairs)

        # voilà; any() stops at first True value
        return not any(inequalities)

if __name__ == "__main__":
    import sys
    print filecmp(sys.argv[1], sys.argv[2])

另一种不同的观点 :)

- ΤΖΩΤΖΙΟΥ

非常巧妙，使用了所有的快捷方式、itertools和partial——赞，这是最好的解决方案！ - Todor Minakov

1

我不得不在Python 3中作出轻微的改变，否则函数永远不会返回：cmp_pairs = izip（iter（fp1_reader，b ''），iter（fp2_reader，b''）） - Ted Striker

@TedStriker，你说得对！感谢您帮助改进这个答案 :) - tzot

6

由于我无法评论其他人的答案，所以我会写下自己的答案。

如果您使用md5，绝对不能只使用md5.update(f.read())，因为这样会占用过多内存。

def get_file_md5(f, chunk_size=8192):
    h = hashlib.md5()
    while True:
        chunk = f.read(chunk_size)
        if not chunk:
            break
        h.update(chunk)
    return h.hexdigest()

- user32141

1

我相信对于这个问题来说任何哈希操作都是过度的；直接逐个比较更快更直接。 - tzot

我只是在澄清有人建议的实际哈希部分。 - user32141

+1 我更喜欢你的版本。而且，我认为使用哈希并不过度。如果你只想知道它们是否不同，那么真的没有什么好理由不这样做。 - Jeremy Cantrell

3

当需要缓存/存储或与已经缓存/存储的内容进行比较时，才使用哈希计算。否则，直接比较字符串即可。无论使用何种硬件，str1 != str2 的速度都比 md5.new(str1).digest() != md5.new(str2).digest() 更快。哈希也可能存在冲突（不太可能但并非不可能）。 - tzot

4

我会使用MD5对文件内容进行哈希处理。

import hashlib

def checksum(f):
    md5 = hashlib.md5()
    md5.update(open(f).read())
    return md5.hexdigest()

def is_contents_same(f1, f2):
    return checksum(f1) == checksum(f2)

if not is_contents_same('foo.txt', 'bar.txt'):
    print 'The contents are not the same!'

- Jeremy Cantrell

2


f = 打开(filename1, "r").读取()
f2 = 打开(filename2,"r").读取()
打印 f == f2

在这段代码中，打开两个文件并将它们作为字符串进行读取。然后检查这两个字符串是否相等并打印结果。

- mmattax

8

“我有一个8 GiB的文件和一个32 GiB的文件，我想对它们进行比较...” - tzot

3

这不是一个好的做法。一个大问题是文件在打开后从未关闭。较不重要的是，在打开和读取文件之前没有进行优化，例如文件大小比较。 - kchawla-pi

1

对于较大的文件，您可以计算文件的MD5或SHA哈希值。

- Nigel Campbell

4

那么如果有两个32 GiB的文件，只有第一个字节不同，怎么办？为什么要浪费CPU时间并等待很长时间才能得到答案？ - tzot

看看我的解决方案，对于较大的文件最好进行缓冲读取。 - Angel

1

from __future__ import with_statement

filename1 = "G:\\test1.TXT"

filename2 = "G:\\test2.TXT"


with open(filename1) as f1:

   with open(filename2) as f2:

      file1list = f1.read().splitlines()

      file2list = f2.read().splitlines()

      list1length = len(file1list)

      list2length = len(file2list)

      if list1length == list2length:

          for index in range(len(file1list)):

              if file1list[index] == file2list[index]:

                   print file1list[index] + "==" + file2list[index]

              else:                  

                   print file1list[index] + "!=" + file2list[index]+" Not-Equel"

      else:

          print "difference inthe size of the file and number of lines"

- Prashanth Babu

0

简单而高效的解决方案：

import os


def is_file_content_equal(
    file_path_1: str, file_path_2: str, buffer_size: int = 1024 * 8
) -> bool:
    """Checks if two files content is equal
    Arguments:
        file_path_1 (str): Path to the first file
        file_path_2 (str): Path to the second file
        buffer_size (int): Size of the buffer to read the file
    Returns:
        bool that indicates if the file contents are equal
    Example:
        >>> is_file_content_equal("filecomp.py", "filecomp copy.py")
            True
        >>> is_file_content_equal("filecomp.py", "diagram.dio")
            False
    """
    # First check sizes
    s1, s2 = os.path.getsize(file_path_1), os.path.getsize(file_path_2)
    if s1 != s2:
        return False
    # If the sizes are the same check the content
    with open(file_path_1, "rb") as fp1, open(file_path_2, "rb") as fp2:
        while True:
            b1 = fp1.read(buffer_size)
            b2 = fp2.read(buffer_size)
            if b1 != b2:
                return False
            # if the content is the same and they are both empty bytes
            # the file is the same
            if not b1:
                return True

- Angel

0

filecmp非常适合用于简单比较文件，但无法打印文件中的行号或差异。

import filecmp

def compare_files(filename1, filename2):
    return filecmp.cmp(filename1, filename2, shallow=False)

这里有一个简单而高效的解决方案，稍微灵活一些，可以打印比较的状态、行号以及文件中有差异的行的值。

def compare_with_line_diff(filename1, filename2):
    with open(filename1, "r") as file1, open(filename2, "r") as file2:

        # Loop for all lines in first file (keep only 2 lines in memory)
        for line_num, f1_line in enumerate(file1, start=1):

            # Only print status for range of lines
            if (line_num == 1 or line_num % 1000 == 0):
                print(f"comparing lines {line_num} to {line_num + 1000}")

            # Compare with next line of file2
            f2_line = file2.readline()
            if (f1_line != f2_line):
                print(f"Difference on line: {line_num}")
                print(f"f1_line: '{f1_line}'")
                print(f"f2_line: '{f2_line}'")
                return False

        # Check if file2 has more lines than file1
        for extra_line in file2:
            print(f"Difference on file2: {extra_line}")
            return False

    # Files are equal
    return True

- Jake Weilhammer

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Federico Ramponi · Accepted Answer

95

较底层的方法：

from __future__ import with_statement
with open(filename1) as f1:
   with open(filename2) as f2:
      if f1.read() == f2.read():
         ...

高层次方法：

import filecmp
if filecmp.cmp(filename1, filename2, shallow=False):
   ...

- Federico Ramponi

14

我修改了你的filecmp.cmp调用，因为没有一个非真的浅层参数，它就不能完成问题所要求的功能。 - tzot

2

你说得对。 http://www.python.org/doc/2.5.2/lib/module-filecmp.html 。非常感谢。 - Federico A. Ramponi

1

顺便提一下，为了确保文件的正确打开，应该以二进制模式打开文件，因为文件的换行符可能不同。 - newtover

9

如果文件很大，这可能会存在问题。如果您首先比较文件大小，则可以节省计算机的一些工作。如果大小不同，显然文件是不同的。只有当大小相同时，您才需要阅读文件。 - Bryan Oakley

6

我刚刚发现filecmp.cmp()函数除了比较文件内容外，还会比较inode号码、ctime以及其他统计信息。在我的应用中，这是不希望出现的。如果只想比较文件内容而不比较文件统计信息，则使用f1.read() == f2.read()可能是更好的方式。 - Ray

显示剩余7条评论