文档差异算法是如何工作的？

Question

文档差异算法是如何工作的？

algorithmdiff

36

我想实现Word文档差异性比较，需要用到哪些算法？

- user183173

2

你会用它来展示差异，还是以最优方式存储差异？ - Lasse V. Karlsen

5个回答

15

差异比较本质上只是解决最长公共子序列问题的一种方法。

最优解需要了解动态规划，因此这是一个相当复杂的问题。

然而，也可以通过构建后缀树来完成。两种算法在这里概述。

- Ben S

1

这通常是当您假设文档为字符或字节流时的情况。然而，这里的问题是关于Word文档的。在实施此类算法之前，您需要问自己一个问题：蓝色8pt Verdana中的“Hello World”与红色10pt Arial中的“Hello World”是否相同等。 - quosoo

3

是的，显然基本算法需要额外的逻辑来解析这种差异，但算法的核心仍然相同。 - Ben S

6

请参阅C#中的O(ND)差异算法。

- Galwegian

3

正如Ben S所指出的那样，差分问题通常可以通过解决最长公共子序列问题来解决。更具体地说，Hunt-McIlroy算法是应用于该问题的经典算法之一（例如在Unix的diff工具的实现中）。

- Brandon E Taylor

2

LCS问题的最优解是O(ND) Myer's算法，我用以下算法实现了对比Office 2007文件。算法论文链接

- SunnyShah

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- CraigTP · Accepted Answer

通常情况下，最长公共子序列问题通常用于解决diff。此外，请参阅维基百科关于Diff的"算法"部分：

The operation of diff is based on solving the longest common subsequence problem.

In this problem, you have two sequences of items:
   a b c d f g h j q z

   a b c d e f g i j k r x y z
and you want to find the longest sequence of items that is present in both original sequences in the same order. That is, you want to find a new sequence which can be obtained from the first sequence by deleting some items, and from the second sequence by deleting other items. You also want this sequence to be as long as possible. In this case it is
   a b c d f g j z
From the longest common subsequence it's only a small step to get diff-like output:
   e   h i   q   k r x y 
   +   - +   -   + + + +

这就意味着，所有基于文本的文档都能正常工作。因为Word文档实际上是二进制格式，并包含大量格式信息和数据，所以这将更加复杂。理想情况下，您可以研究自动化Word本身，因为它具有“diff”两个文档之间的能力，详见此处：Microsoft Word提示：如何比较两个文档的差异。