两个列表之间序列差异的算法评估

3

我正在寻找一种比较两个序列的算法。

序列A - 将是一个按最优顺序排列的整数ID列表。

序列B - 将是相同ID的列表,但顺序可能不同。

我希望能够检测出这两个列表之间的差异。

因此,我正在寻找一种解决此问题的算法。我想知道是否有已经解决过这个常见问题的算法。


4
你是否看过DiffLib (声明:我编写了这个库)?我不喜欢“点击此链接查找答案”的回答方式,所以我不会将其发布为答案。 - Lasse V. Karlsen
你可以使用动态规划,它非常适合识别两个序列中元素的添加、删除或修改。 - Nolonar
我还有这些在线文章,其中包括DiffLib的基本版本:http://devdirective.com/post/91/creating-a-reusable-though-simple-diff-implementation-in-csharp-part-1 - 再次强调,这不是Stack Overflow上的好答案。 - Lasse V. Karlsen
2个回答

4

正如Julián Urbano所建议的那样,Kendall Tau相关性是一个很好的测量指标。我决定使用Linq在.Net中实现它。这是我的代码,它实现了Tau-A(用于没有关系的数据)和Tau-B(允许存在关系)。该代码假定您的数据尚未排序,因此它会根据Measure1对其进行一次排序以获取第一组排名值,然后按Measure2对其进行排序以获取第二组排名。相关的是排名而不是原始数据。(如果测量lambda函数返回原始对象未更改,则可以将其应用于您已经拥有的排名。)

using System;
using System.Collections.Generic;
using System.Linq;
using static System.Math;

namespace Statistics
{
    /// <summary>
    /// Compute the Kendall Tau Correlation of two orderings of values, a non-parametric correlation that compares the ranking
    /// of value, not the values themselves.
    /// 
    /// The correlation of the measures is one if all values are in the same order when sorted by two different measures.
    /// The correlation is minus one if the second ordering is the reverse of the first.
    /// The correlation is zero if the values are completely uncorrelated.
    /// 
    /// Two algorithms are provided: TauA and TauB. TauB accounts properly for duplicate values (ties), unlike TauA.
    /// </summary>
    public class KendallTauCorrelation<T, C> where C : IComparable<C>
    {
        private Func<T, C> Measure1 { get; }
        private Func<T, C> Measure2 { get; }

        public KendallTauCorrelation(Func<T, C> measure1, Func<T, C> measure2)
        {
            Measure1 = measure1;
            Measure2 = measure2;
        }

        /// <summary>
        /// Compute the Tau-a rank correlation, which is suitable if there are no ties in rank.
        /// </summary>
        /// <returns>A value between -1 and 1. 
        /// If the measures are ranked the same by both measures, returns 1.
        /// If the measures are ranked in exactly opposite order, return -1.
        /// The more items that are out of sequence, the lower the score.
        /// If the measures are completely uncorrelated, returns zero.
        /// </returns>
        /// <param name="data">Data to be ranked according to two measures and then correlated.</param>
        public double TauA(IList<T> data)
        {
            var ranked = data
                     .OrderBy(Measure1)
                     .Select((item, index) => new { Data = item, Rank1 = index + 1})
                     .OrderBy(pair => Measure2(pair.Data))
                     .Select((pair, index) => new { pair.Rank1, Rank2 = index + 1 })
                     .ToList();
            var numerator = 0;

            var n = ranked.Count;
            var denominator = n * (n - 1) / 2.0;
            for (var i = 1; i < n; i++)
                for (var j = 0; j < i; j++)
                {
                    numerator += Sign(ranked[i].Rank1 - ranked[j].Rank1) 
                               * Sign(ranked[i].Rank2 - ranked[j].Rank2);
                }
            return numerator / denominator;
        }

        /// <summary>
        /// Compute the Tau-b correlation, which accounts for ties.
        /// 
        ///             n  - n
        ///              c    d
        ///  τ  = -----------------------
        ///   b    _____________________
        ///       / (n  -  n )(n  -  n )
        ///      √    0     1   0     2
        /// 
        /// where:
        ///        n0 = n(n-1)/2
        ///               
        ///        n1 =  Σ  t (t - 1)/2
        ///              i   i  i
        /// 
        ///        n2 =  Σ  t (t - 1)/2
        ///              j   j  j
        /// 
        ///      t[i] = # of ties for the ith group according to measure 1.
        ///      t[j] = # of ties for the jth group according to measure 2.
        ///        nc = # of concordant pairs
        ///        nd = # of discordant pairs
        /// </summary>
        /// <returns>A correlation value between -1 (perfect reverse correlation)
        ///  and +1 (perfect correlation). 
        /// Zero means uncorrelated. </returns>
        /// <param name="data">Data.</param>
        public double TauB(IEnumerable<T> data)
        {
            // Compute two Ranks by sorting first by Measure1 and then by Measure2.
            // Group by like values of each in order to handle ties.
            var ranked = data.Select(item => new { M1 = Measure1(item), M2 = Measure2(item) })
                .GroupBy(measures => new { measures.M1 })
                .OrderBy(@group => @group.First().M1)
                .ThenBy(@group => @group.First().M2)
                .AsEnumerable()
                .Select((@group, groupIndex) => new
                {
                    Measure1Ranked = @group.Select((measure, index) => new { measure.M1, measure.M2 }),
                    Rank = ++groupIndex
                })
                .SelectMany(v => v.Measure1Ranked, (s, i) => new
                {
                    i.M1,
                    i.M2,
                    DenseRank1 = s.Rank
                })
                .GroupBy(measures => new { measures.M2 })
                .OrderBy(@group => @group.First().M2)
                .ThenBy(@group => @group.First().M1)
                .AsEnumerable()
                .Select((@group, groupIndex) => new
                 {
                     Measure2Ranked = @group.Select((measure, index) => new { measure.M1, measure.M2, measure.DenseRank1 }),
                     Rank = ++groupIndex
                 })
                .SelectMany(v => v.Measure2Ranked, (s, i) => new { i.M1, i.M2, i.DenseRank1, DenseRank2 = s.Rank })                          
                .ToArray();
            if (ranked.Length <= 1)
                return 0; // No data or one data point. Impossible to establish correlation.

            // Now that we have ranked the data, compute the correlation.
            var n = ranked.Count();
            var n0 = n * (n - 1) / 2;
            var n1 = 0;
            var n2 = 0;
            var numerator = 0; // Stores nc - nd as a single value, rather than computing them separately.
            for (var i = 1; i < n; i++)
                for (var j = 0; j < i; j++)
                {
                    var iRanked = ranked[i];
                    var jRanked = ranked[j];
                    numerator += Sign(iRanked.DenseRank1 - jRanked.DenseRank1)
                               * Sign(iRanked.DenseRank2 - jRanked.DenseRank2);
                    // Keep track of ties. Because we are running the indices in a triangle,
                    // we automatically get this for n1 and n2: ties * (ties - 1) / 2
                    if (iRanked.M1.CompareTo(jRanked.M1) == 0)
                        n1++;
                    if (iRanked.M2.CompareTo(jRanked.M2) == 0)
                        n2++;
                }
            if (n0 == n1 || n0 == n2)
                return 0; // All ties, so everything as the same rank.
            // Observe that if n1 = n2 = 0, that this formula is identical to Tau-a.
            return numerator / Sqrt((double)(n0 - n1)*(n0 - n2));
        }
    }
}

这里是 NUnit 中的单元测试:
using System;
using NUnit.Framework;
using static System.Math; // New C# 6.0 feature that allows one to import static methods and call them without their class name.

namespace Statistics
{

    [TestFixture]
    public class KendallTauCorrelationTests
    {
        public static int[] OneToTen = new[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };

        #region Tau-a

        [Test]
        public void TauA_SameOrder()
        {
            var kendall = new KendallTauCorrelation<int, int>(
                (int value) => value,
                (int value) => value * 10
            );
            Assert.AreEqual(
                1.0,
                kendall.TauA(OneToTen),
                "Numbers that sort in the same order should be perfectly correlated."
            );
        }

        [Test]
        public void TauA_ReverseOrder()
        {
            var kendall = new KendallTauCorrelation<int, int>(
                (int value) => value,
                (int value) => value * -10
            );
            Assert.AreEqual(
                -1.0,
                kendall.TauA(OneToTen),
                "Numbers that sort in reverse order should be perfectly anti-correlated."
            );
        }

        [Test]
        public void TauA_OneSwap()
        {
            var reordered = new[] { 1, 2, 3, 5, 4, 6, 7, 8, 9, 10 };
            var kendall = new KendallTauCorrelation<int, int>(
                (int value) => value,
                (int value) => reordered[value - 1]
            );
            Assert.AreEqual(
                43.0 / 45.0,
                kendall.TauA(OneToTen),
                0.00001,
                "If a single number is out of place the sequences should be almost perfectly correlated."
            );
        }

        #endregion

        #region Tau-b

        [Test]
        public void TauB_SameOrder()
        {
            var kendall = new KendallTauCorrelation<int,int>(
                (int value) => value, 
                (int value) => value * 10
            );
            Assert.AreEqual(
                1.0, 
                kendall.TauB(OneToTen), 
                "Numbers that sort in the same order should be perfectly correlated."
            );
        }

        [Test]
        public void TauB_ReverseOrder()
        {
            var kendall = new KendallTauCorrelation<int, int>(
                (int value) => value,
                (int value) => value * -10
            );
            Assert.AreEqual(
                -1.0,
                kendall.TauB(OneToTen),
                "Numbers that sort in reverse order should be perfectly anti-correlated."
            );
        }

        [Test]
        public void TauB_OneSwap_NoTies()
        {
            var reordered = new[] { 1,2,3,5,4,6,7,8,9,10 };
            var kendall = new KendallTauCorrelation<int, int>(
                (int value) => value,
                (int value) => reordered[value-1]
            );
            Assert.AreEqual(
                43.0/45.0,
                kendall.TauB(OneToTen),
                0.00001,
                "If a single number is out of place the sequences should be almost perfectly correlated."
            );
        }

        [Test]
        public void TauB_Ties()
        {
            var reordered = new[] { 1, 1, 1, 4, 5, 6, 7, 8, 9, 10 };
            var kendall = new KendallTauCorrelation<int, int>(
                (int value) => value,
                (int value) => reordered[value - 1]
            );
            Assert.AreEqual(
                42.0 / Sqrt(42.0*45.0),
                kendall.TauB(OneToTen),
                0.00001,
                "Adding a few ties should be almost perfectly correlated."
            );
        }

        #endregion
    }
}

注意:这里使用的是耗时O(N^2)的算法。有一种更高效的方法,使用改进后的归并排序,时间复杂度为N Log N,我听过它的提到,但我没有看到它是如何实现的。

注意:这个通用类假定两个度量返回相同的数据类型。可以轻松更改类具有两个通用度量类型。它们只需要是可比较的IComparables。它们不需要相互比较。


2
如果您只想测量它们有多不同,但不关心差异出现在哪里,您可以使用肯德尔相关系数。它会给您一个得分从-1(列表顺序相反)到+1(列表顺序相同)。
它基本上计算两个列表中按相同顺序排列的元素对的数量,并将其除以总对数:
int[] a = { 1, 2, 3, 4, 5, 6, 7, 8 };
int[] b = { 3, 4, 1, 8, 6, 7, 2, 5 };

double numer = 0;
for (int i = 0; i < (a.Length - 1); i++)
  for (int j = i + 1; j < a.Length; j++)
    numer += Math.Sign(a[i] - a[j]) * Math.Sign(b[i] - b[j]);

double tau = numer / (a.Length * (a.Length - 1) / 2);

这似乎是 Kendall Tau-a。如果您的数据在排名上存在并列(因为项目不唯一),则应使用 Kendall tau-b,它考虑了并列但更加复杂。请参阅 https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient。 - Paul Chernoch

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接