这个算法如何计算所有唯一整数？

Question

这个算法如何计算所有唯一整数？

4

下面是一个算法，用于计算有序数组中所有唯一整数的数量，但我不确定它是如何使用二分查找完成的？有人能解释一下吗？谢谢。

int unique(int[] a) {
    int i = 0;
    int count = 0;
    while (i < a.length) {
      i = nextIndex(a, i, a[i]);
      count++;
    }
    return count;
  }

  int nextIndex(int[] a, int l, int target) {
    int r = a.length - 1;
    while (l <= r) {
      int mid = l + (r - l) / 2;
      if (a[mid] == target) l = mid + 1;
      else r = mid - 1;
    }
    return r + 1;
  }

- mewowo

3

你尝试过调试代码吗？例如使用调试器逐行查看代码并监视变量及其值来进行调试？ - Some programmer dude

3

不，你下面有实现算法的代码。我怀疑问题在于你并不知道它正在实现的算法是什么。 - John

1

为什么不使用简单的循环呢？ - ProgrammingLlama

1

那个算法只对大部分元素都是重复的数组有用，例如一个包含100万个元素但只有3个唯一值的数组。 - user3386109

2个回答

6

nextIndex 方法本质上是一种二分查找方法，用于查找目标值的最后一个出现位置，然后返回此位置加一的索引。因此，外部循环将迭代并增加计数，其次数等于唯一值的数量。

请注意，只有在性能分析显示需要时，才会使用这种复杂的方法。标准方法应该是 myValues.Distinct().Count()，它适用于类型不同于 int 数组且无序列表的情况，并且应该更快。

连续的二分搜索应具有复杂度为O(m log n)，其中n是总项数，m是唯一值的数量。这表明如果平均少于log n个重复值，则线性搜索应该更快。这样的线性搜索可能如下所示:

int count = 1;
for(int i = 1; i < a.Count; i++){
    count += a[i-1] != a[i] ? 1 : 0;
}

请记住，有些不断变化的因素可能会影响结果。例如，随机访问内存比线性访问更加昂贵，因为它使得缓存更加困难。因此，在某些情况下，与硬件更好匹配的简单算法优于复杂的算法，并且一些实现会根据数据集切换策略。

- JonasH

即使是简单的1、2、3……有序数组，也会被这个花哨的算法处理得更慢。 - Alexei Levenkov

仅在重复计数高的情况下，才可以使用二分搜索方法。 - user1196549

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Matthew Watson · Accepted Answer

JonasH已经发布了关于这个问题的实际答案。

但我对比使用Distinct()或者线性搜索，想知道二分查找法是否更快（如果更快的话）。

这是我用来进行基准测试的代码:

using System;
using System.Linq;
using BenchmarkDotNet.Attributes;

namespace ConsoleApp1;

public class UnderTest
{
    public UnderTest()
    {
        const int NUM_VALUES = 1_000_000;
        const double PROBABLILITY_OF_CHANGE = 0.00001;

        _data = new int[NUM_VALUES];

        var rng = new Random(98891); // Fixed seed.

        for (int value = 0, i = 0; i < NUM_VALUES; ++i)
        {
            _data[i] = value;

            if (rng.NextDouble() <= PROBABLILITY_OF_CHANGE)
                ++value;
        }

        // Print out to prove they all return the same value.
        Console.WriteLine(usingBinarySearch(_data));
        Console.WriteLine(usingDistinct    (_data));
        Console.WriteLine(usingLinearSearch(_data));
    }

    [Benchmark]
    public void UsingBinarySearch()
    {
        usingBinarySearch(_data);
    }

    [Benchmark]
    public void UsingDistinct()
    {
        usingDistinct(_data);
    }

    [Benchmark]
    public void UsingLinearSearch()
    {
        usingLinearSearch(_data);
    }

    static int usingBinarySearch(int[] a)
    {
        int i     = 0;
        int count = 0;

        while (i < a.Length)
        {
            i = nextIndex(a, i, a[i]);
            count++;
        }

        return count;
    }

    static int nextIndex(int[] a, int l, int target)
    {
        int r = a.Length - 1;

        while (l <= r)
        {
            int mid = l + (r - l) / 2;
            if (a[mid] == target) l = mid + 1;
            else r = mid - 1;
        }
        return r + 1;
    }

    static int usingDistinct(int[] a)
    {
        return a.Distinct().Count();
    }

    static int usingLinearSearch(int[] a)
    {
        int count = 1;

        for (int i = 1; i < a.Length; i++)
        {
            count += a[i - 1] != a[i] ? 1 : 0;
        }

        return count;
    }

    readonly int[] _data;
}

在第一次测试中，我提供了一些数据，期望二分查找的速度明显更快：一个包含100万个整数的数组，每个元素值大于前一个元素的概率为0.00001（PROBABLILITY_OF_CHANGE = 0.00001）。

使用一个固定种子为98891的随机数生成器，结果数组中只有7个不同的值，得出以下计时结果（使用.NET 6.0）：

|            Method |           Mean |         Error |        StdDev |
|------------------ |---------------:|--------------:|--------------:|
| UsingBinarySearch |       251.0 ns |       4.93 ns |      12.64 ns |
|     UsingDistinct | 9,341,067.8 ns | 185,888.76 ns | 294,839.27 ns |
| UsingLinearSearch | 1,607,222.2 ns |  51,565.50 ns | 146,282.75 ns |

在这种情况下，二分搜索的速度比线性搜索快得多。值得注意的是，与线性搜索相比，Distinct()非常慢。

但需要注意的是，这并不是一个公平的比较，因为Distinct()可以用于未排序的数据，而其他两个算法需要排序后才能使用。如果您有未排序的数据，则为其他算法排序的开销会使Distinct()成为更好的选择。这样的比较留给读者作为谚语式的练习...

现在让我们尝试使用PROBABLILITY_OF_CHANGE = 0.001（结果为919个不同的元素）：

|            Method |        Mean |      Error |     StdDev |
|------------------ |------------:|-----------:|-----------:|
| UsingBinarySearch |    93.08 us |   1.787 us |   4.831 us |
|     UsingDistinct | 9,944.12 us | 197.825 us | 356.718 us |
| UsingLinearSearch | 1,503.85 us |  28.239 us |  63.740 us |

二分搜索仍然显著更快。

现在，PROBABLILITY_OF_CHANGE = 0.1（结果有100,058个不同的元素）：

|            Method |      Mean |     Error |    StdDev |
|------------------ |----------:|----------:|----------:|
| UsingBinarySearch |  5.541 ms | 0.1096 ms | 0.1347 ms |
|     UsingDistinct | 14.331 ms | 0.5516 ms | 1.6091 ms |
| UsingLinearSearch |  2.319 ms | 0.0422 ms | 0.0782 ms |

现在线性搜索比二分搜索更快。

这表明，Distinct() 不是解决此问题的好方法 - 线性搜索始终更好（主要是因为数据已经排序 - 如果没有排序，我们就必须对其进行排序，这将显着改变其他算法的时间）。

只有在不同值的数量相对于容器大小较低时，使用二分搜索才值得。

（请注意，这些运行由 Benchmark.Net 输出的不同单位 - ns、us 和 ms。我意识到后来我忘记了指定单位，所以它们被自动缩放到运行中最快的基准测试...）