在C#中高效地合并两个有序序列

6

假设这两个有序序列:

var outer = new char[] { 'a', 'b', 'b', 'c', 'd', 'd', 'e' };
var inner = new char[] { 'a', 'b', 'c', 'c', 'd', 'd' };

既然两个序列的元素都是有序的,那么如何比使用Enumerable.Join更高效地将它们进行内连接以产生以下元组序列?

{ 'a', 'a' }
{ 'b', 'b' }
{ 'b', 'b' }
{ 'c', 'c' }
{ 'c', 'c' }
{ 'd', 'd' }
{ 'd', 'd' }
{ 'd', 'd' }
{ 'd', 'd' }

请注意,与仅从两个序列中产生不同元素的Enumerable.Intersect不同,此处的输出序列返回表示来自一对一、一对多或多对多关系的每个元素组合的元组。
语义与SQL Server中的INNER JOIN相似。但更具体地说,我正在寻找一个C#实现,其性能特征类似于merge join算法(INNER MERGE JOIN),可以返回具有延迟执行的IEnumerable
所需的方法签名可能如下所示:
IEnumerable<TResult> MergeJoin<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer, 
    IEnumerable<TInner> inner, 
    Func<TOuter, TKey> outerKeySelector, 
    Func<TInner, TKey> innerKeySelector, 
    Func<TOuter, TInner, TResult> resultSelector)

3
有什么阻止你自己实现这个吗?你甚至链接了一些伪代码。这里的问题是“帮我写代码”吗? - Jeroen Mostert
1
我很快会发布我的答案。 - Biscuits
1
你能否将 outer, inner 类型更改为 IList,例如? - Alexander Petrov
2
不,你必须假设对于任一序列,没有关于元素数量的先验知识,也不能按其序数位置或稀疏方式访问任意元素。 - Biscuits
2个回答

2
如果两个序列都已排序,则来自MoreLinq库的MoreEnumerable.OrderedMerge可以完成此工作。

https://github.com/morelinq/MoreLINQ

using MoreLinq;

IEnumerable<char> result = outer.OrderedMerge(innner);

效率很好,与内连接相比。 当N和M是每个序列的长度时, 内连接会产生笛卡尔积,因此时间与NxM成正比。 有序合并遍历每个集合一次,因此时间与N+M成正比。
如果序列未排序,则标准的Linq Enumerable.OrderBy可以完成任务。
还有一些重载函数:
// to provide a custom comparison criteria
public static IEnumerable<T> OrderedMerge<T>(this IEnumerable<T> first, IEnumerable<T> second, IComparer<T> comparer);

// to provide the key for comparisons
IEnumerable<T> OrderedMerge<T, TKey>(this IEnumerable<T> first, IEnumerable<T> second, Func<T, TKey> keySelector);

// + other overloads to select element to be merged when first element is less than second, 
// when second element is less than first 
// when first and second element are equal

2

我不知道是否存在任何现有的Enumerable扩展方法可以实现这一点,而且没有人应该花费比我更多的时间来想出这个。

public static IEnumerable<TResult> MergeJoin<TOuter, TInner, TKey, TResult>(this IEnumerable<TOuter> outer, IEnumerable<TInner> inner, Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector, Func<TOuter, TInner, TResult> resultSelector)
{
    if (outer == null)
        throw new ArgumentNullException(nameof(outer));

    if (inner == null)
        throw new ArgumentNullException(nameof(inner));

    if (outerKeySelector == null)
        throw new ArgumentNullException(nameof(outerKeySelector));

    if (innerKeySelector == null)
        throw new ArgumentNullException(nameof(innerKeySelector));

    if (resultSelector == null)
        throw new ArgumentNullException(nameof(resultSelector));

    return MergeJoinIterator(outer, inner, outerKeySelector, innerKeySelector, resultSelector, Comparer<TKey>.Default);
}

public static IEnumerable<TResult> MergeJoin<TOuter, TInner, TKey, TResult>(this IEnumerable<TOuter> outer, IEnumerable<TInner> inner, Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector, Func<TOuter, TInner, TResult> resultSelector, IComparer<TKey> comparer)
{
    if (outer == null)
        throw new ArgumentNullException(nameof(outer));

    if (inner == null)
        throw new ArgumentNullException(nameof(inner));

    if (outerKeySelector == null)
        throw new ArgumentNullException(nameof(outerKeySelector));

    if (innerKeySelector == null)
        throw new ArgumentNullException(nameof(innerKeySelector));

    if (resultSelector == null)
        throw new ArgumentNullException(nameof(resultSelector));

    if (comparer == null)
        throw new ArgumentNullException(nameof(comparer));

    return MergeJoinIterator(outer, inner, outerKeySelector, innerKeySelector, resultSelector, comparer);
}

private static IEnumerable<TResult> MergeJoinIterator<TOuter, TInner, TKey, TResult>(IEnumerable<TOuter> outer, IEnumerable<TInner> inner, Func<TOuter, TKey> outerKeySelector, Func<TInner, TKey> innerKeySelector, Func<TOuter, TInner, TResult> resultSelector, IComparer<TKey> comparer)
{
    IEnumerator<TOuter> outerEnumerator = outer.GetEnumerator();

    if (!outerEnumerator.MoveNext())
        yield break;

    IEnumerator<TInner> innerEnumerator = inner.GetEnumerator();

    if (!innerEnumerator.MoveNext())
        yield break;

    TOuter outerElement = outerEnumerator.Current;
    TKey outerKey = outerKeySelector(outerElement);

    TInner innerElement = innerEnumerator.Current;
    TKey innerKey = innerKeySelector(innerElement);

    int comp = comparer.Compare(innerKey, outerKey);

    while (true)
    {
        if (comp < 0)
        {
            if (!innerEnumerator.MoveNext())
                break;

            innerElement = innerEnumerator.Current;
            innerKey = innerKeySelector(innerElement);
            comp = comparer.Compare(innerKey, outerKey);
            continue;
        }

        if (comp > 0)
        {
            if (!outerEnumerator.MoveNext())
                break;

            outerElement = outerEnumerator.Current;
            outerKey = outerKeySelector(outerElement);
            comp = comparer.Compare(innerKey, outerKey);
            continue;
        }

        yield return resultSelector(outerElement, innerElement);

        if (!outerEnumerator.MoveNext())
        {
            while (true)
            {
                if (!innerEnumerator.MoveNext())
                    break;

                innerElement = innerEnumerator.Current;
                innerKey = innerKeySelector(innerElement);
                comp = comparer.Compare(innerKey, outerKey);

                if (comp != 0)
                    break;

                yield return resultSelector(outerElement, innerElement);
            }

            break;
        }

        if (!innerEnumerator.MoveNext())
        {
            while (true)
            {
                outerElement = outerEnumerator.Current;
                outerKey = outerKeySelector(outerElement);
                comp = comparer.Compare(innerKey, outerKey);

                if (comp != 0)
                    break;

                yield return resultSelector(outerElement, innerElement);

                if (!outerEnumerator.MoveNext())
                    break;
            }

            break;
        }

        TOuter outerElementNext = outerEnumerator.Current;
        TKey outerKeyNext = outerKeySelector(outerElementNext);

        TInner innerElementNext = innerEnumerator.Current;
        TKey innerKeyNext = innerKeySelector(innerElementNext);

        comp = comparer.Compare(outerKeyNext, outerKey);
        bool stop = false;

        if (comp != 0)
        {
            comp = comparer.Compare(innerKeyNext, innerKey);

            if (comp == 0)
            {
                yield return resultSelector(outerElement, innerElementNext);

                while (true)
                {
                    if (!innerEnumerator.MoveNext())
                    {
                        stop = true;
                        break;
                    }

                    innerElementNext = innerEnumerator.Current;
                    innerKeyNext = innerKeySelector(innerElementNext);
                    comp = comparer.Compare(innerKeyNext, outerKey);

                    if (comp != 0)
                        break;

                    yield return resultSelector(outerElement, innerElementNext);
                }

                if (stop)
                    break;
            }

            outerElement = outerElementNext;
            outerKey = outerKeyNext;
            innerElement = innerElementNext;
            innerKey = innerKeyNext;
            comp = comparer.Compare(innerKey, outerKey);
            continue;
        }

        comp = comparer.Compare(innerKeyNext, innerKey);

        if (comp != 0)
        {
            yield return resultSelector(outerElementNext, innerElement);

            while (true)
            {
                if (!outerEnumerator.MoveNext())
                {
                    stop = true;
                    break;
                }

                outerElementNext = outerEnumerator.Current;
                outerKeyNext = outerKeySelector(outerElementNext);
                comp = comparer.Compare(innerKey, outerKeyNext);

                if (comp != 0)
                    break;

                yield return resultSelector(outerElementNext, innerElement);
            }

            if (stop)
                break;

            outerElement = outerElementNext;
            outerKey = outerKeyNext;
            innerElement = innerElementNext;
            innerKey = innerKeyNext;
            comp = comparer.Compare(innerKey, outerKey);
            continue;
        }

        yield return resultSelector(outerElement, innerElementNext);
        var innerRest = new List<TInner>();

        TInner innerElementFollowing = default(TInner);
        TKey innerKeyFollowing = default(TKey);

        while (true)
        {
            if (!innerEnumerator.MoveNext())
            {
                stop = true;
                break;
            }

            innerElementFollowing = innerEnumerator.Current;
            innerKeyFollowing = innerKeySelector(innerElementFollowing);
            comp = comparer.Compare(innerKeyFollowing, outerKey);

            if (comp != 0)
                break;

            yield return resultSelector(outerElement, innerElementFollowing);
            innerRest.Add(innerElementFollowing);
        }

        yield return resultSelector(outerElementNext, innerElement);
        yield return resultSelector(outerElementNext, innerElementNext);

        for (int i = 0; i < innerRest.Count; i++)
            yield return resultSelector(outerElementNext, innerRest[i]);

        while (true)
        {
            if (!outerEnumerator.MoveNext())
            {
                stop = true;
                break;
            }

            outerElementNext = outerEnumerator.Current;
            outerKeyNext = outerKeySelector(outerElementNext);
            comp = comparer.Compare(innerKey, outerKeyNext);

            if (comp != 0)
                break;

            yield return resultSelector(outerElementNext, innerElement);
            yield return resultSelector(outerElementNext, innerElementNext);

            for (int i = 0; i < innerRest.Count; i++)
                yield return resultSelector(outerElementNext, innerRest[i]);
        }

        if (stop)
            break;

        outerElement = outerElementNext;
        outerKey = outerKeyNext;
        innerElement = innerElementFollowing;
        innerKey = innerKeyFollowing;
        comp = comparer.Compare(innerKey, outerKey);
    }
}

1
你可以透露一下你的测试吗?我只是尝试了以下代码:var r = new Random(100);var outer = Enumerable.Range(0, 100000).Select(x => r.Next(500)).OrderBy(x => x).ToList();var inner = Enumerable.Range(0, 100000).Select(x => r.Next(500)).OrderBy(x => x).ToList();/*Stopwatch...;*/var res = MergeJoin(outer, inner, o => o, i => i, (o, i) => new { o, i }).ToList(); 而且 Join 比你和我写的都快,所以我很想知道在什么情况下你的实现会表现得更好。 - grek40
1
在我的端上运行你的测试,Enumerable.Join 确实表现更佳。有趣的是,当将随机数的最大值增加到10000(每个序列平均重复10次)时,MergeJoin 开始表现更好。我猜这种技术在序列没有过多重复项或者需要提前结束时会表现得更好。 - Biscuits
1
在最理想的情况下,使用一对一关系,MergeJoin 可以比连接具有 100 万个唯一元素序列的情况快约 3.5 倍。 - Biscuits
1
此外,当内部序列中的唯一元素数量进一步增加时,Enumerable.Join 真正开始发挥作用。 - Biscuits
1
似乎你的方法在更多独特元素方面表现出色(令人印象深刻),但对于更多重复元素,你的性能与连接相当,但并没有真正更好(至少根据我运行的一些测试)。 - grek40
1
我的发现与你的类似。对于足够大、完全重复的序列,两者都很擅长因 OutOfMemoryException 而崩溃 :) - Biscuits

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接