我想实现Word文档差异性比较,需要用到哪些算法?
通常情况下,最长公共子序列问题通常用于解决diff
。此外,请参阅维基百科关于Diff的"算法"部分:
这就意味着,所有基于文本的文档都能正常工作。因为Word文档实际上是二进制格式,并包含大量格式信息和数据,所以这将更加复杂。理想情况下,您可以研究自动化Word本身,因为它具有“diff”两个文档之间的能力,详见此处:Microsoft Word提示:如何比较两个文档的差异。The operation of diff is based on solving the longest common subsequence problem.
In this problem, you have two sequences of items:
a b c d f g h j q z a b c d e f g i j k r x y z
and you want to find the longest sequence of items that is present in both original sequences in the same order. That is, you want to find a new sequence which can be obtained from the first sequence by deleting some items, and from the second sequence by deleting other items. You also want this sequence to be as long as possible. In this case it is
a b c d f g j z
From the longest common subsequence it's only a small step to get diff-like output:
e h i q k r x y + - + - + + + +
正如Ben S所指出的那样,差分问题通常可以通过解决最长公共子序列问题来解决。更具体地说,Hunt-McIlroy算法是应用于该问题的经典算法之一(例如在Unix的diff工具的实现中)。