为什么SortedSet<T>.GetViewBetween不是O(log N)？

Question

为什么SortedSet<T>.GetViewBetween不是O(log N)？

74

在.NET 4.0+中，类SortedSet<T>有一个名为GetViewBetween(l, r)的方法，它返回一个接口视图，其中包含两个指定值之间的所有值。鉴于SortedSet<T>是作为红黑树实现的，我自然期望它以O(log N)时间运行。C++中类似的方法是std::set::lower_bound/upper_bound，Java中是TreeSet.headSet/tailSet，它们都是对数级别的。

然而，事实并非如此。下面的代码运行时间为32秒，而等效的O(log N)版本的GetViewBetween会使该代码运行时间为1-2秒。

var s = new SortedSet<int>();
int n = 100000;
var rand = new Random(1000000007);
int sum = 0;
for (int i = 0; i < n; ++i) {
    s.Add(rand.Next());
    if (rand.Next() % 2 == 0) {
        int l = rand.Next(int.MaxValue / 2 - 10);
        int r = l + rand.Next(int.MaxValue / 2 - 10);
        var t = s.GetViewBetween(l, r);
        sum += t.Min;
    }
}
Console.WriteLine(sum);

我使用dotPeek对System.dll进行了反编译，以下是我得到的内容：

public TreeSubSet(SortedSet<T> Underlying, T Min, T Max, bool lowerBoundActive, bool upperBoundActive)
    : base(Underlying.Comparer)
{
    this.underlying = Underlying;
    this.min = Min;
    this.max = Max;
    this.lBoundActive = lowerBoundActive;
    this.uBoundActive = upperBoundActive;
    this.root = this.underlying.FindRange(this.min, this.max, this.lBoundActive, this.uBoundActive);
    this.count = 0;
    this.version = -1;
    this.VersionCheckImpl();
}

internal SortedSet<T>.Node FindRange(T from, T to, bool lowerBoundActive, bool upperBoundActive)
{
  SortedSet<T>.Node node = this.root;
  while (node != null)
  {
    if (lowerBoundActive && this.comparer.Compare(from, node.Item) > 0)
    {
      node = node.Right;
    }
    else
    {
      if (!upperBoundActive || this.comparer.Compare(to, node.Item) >= 0)
        return node;
      node = node.Left;
    }
  }
  return (SortedSet<T>.Node) null;
}

private void VersionCheckImpl()
{
    if (this.version == this.underlying.version)
      return;
    this.root = this.underlying.FindRange(this.min, this.max, this.lBoundActive, this.uBoundActive);
    this.version = this.underlying.version;
    this.count = 0;
    base.InOrderTreeWalk((TreeWalkPredicate<T>) (n =>
    {
      SortedSet<T>.TreeSubSet temp_31 = this;
      int temp_34 = temp_31.count + 1;
      temp_31.count = temp_34;
      return true;
    }));
}

显然，FindRange的时间复杂度是O(log N)，但随后我们调用了VersionCheckImpl... 它只对找到的子树进行线性遍历来重新计算节点数！

为什么你需要一直进行这种遍历？
为什么.NET没有像C++或Java那样基于键值拆分树的O(log N)方法？在许多情况下，它真的非常有帮助。

- Skiminok

16

没错，是VersionCheckImpl()这个函数破坏了它的性能。在.NET集合类中，这个检查非常重要，我想不出更好的方法。只要你使用这个子集并进行检查，你就可以得到O(log n)的效果，但创建它的时间复杂度是O(n)。你可以在connect.microsoft.com上发帖指出这一点，并从内部人员那里获得意见。然而，他们很可能会将其关闭为“按设计要求”。 - Hans Passant

3

BCL犯了一个非常荒唐的错误。GetRange方法本应比线性过滤更高效！（注：BCL指.NET框架的基础类库，GetRange和线性过滤是其中的两个方法） - usr

25

这是 SortedSet<T>.Count 的时间复杂度要求为 O(1) 所带来的悲惨后果。 - Raymond Chen

4

我想指出它并不像线性滤波器那么糟糕。如果我正确理解了代码，只有范围是线性遍历的，而不是整个集合。 - hwiechers

8

来自2017年的问候，SortedSet<T>.GetViewBetween(...) 的dotnet core实现是O(log(n))的。当然，还要加上你需要提取的元素。 - Kristoffer la Cour

显示剩余4条评论

2个回答

13

如果有像我一样在问题被提出10年后回来查看的人，这里是TreeSubSet实现的链接：https://github.com/dotnet/runtime/blob/fae7ee8e8e3aa7f86836318a10ed676641e813ad/src/libraries/System.Collections/src/System/Collections/Generic/SortedSet.TreeSubSet.cs#L38。现在似乎已经删除了对VersionCheckImpl()的调用。

所以我猜现在你可以这样做：

SortedSet<int> ss = new();
ss.Add(1);
ss.Add(2);
//ss.Add(3);
ss.Add(4);
ss.Add(5);
ss.Add(6);
var four = ss.GetViewBetween(3, ss.Max()).First();

在O(logn)时间复杂度内。

- nicolattu

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- llj098 · Accepted Answer

关于`version`字段

更新1：

在我的记忆中，BCL中的许多（也许是全部）集合都有version字段。

首先，关于`foreach`：

根据这个MSDN链接

foreach语句为数组或对象集合中的每个元素重复一组嵌入式语句。 foreach语句用于迭代集合以获取所需信息，但不应用于更改集合内容以避免不可预测的副作用。

在许多其他集合中，version被保护，数据在foreach期间不会被修改。

例如，HashTable的MoveNext()：

public virtual bool MoveNext()
{
    if (this.version != this.hashtable.version)
    {
        throw new InvalidOperationException(Environment.GetResourceString("InvalidOperation_EnumFailedVersion"));
    }
    //..........
}

但是在 SortedSet<T> 的 MoveNext() 方法中：

public bool MoveNext()
{
    this.tree.VersionCheck();
    if (this.version != this.tree.version)
    {
        ThrowHelper.ThrowInvalidOperationException(ExceptionResource.InvalidOperation_EnumFailedVersion);
    }       
    //....
}

更新2：

但是O(N)循环可能不仅适用于version，还适用于Count属性。

因为GetViewBetween的MSDN中说：

此方法返回一个视图，该视图包括根据比较器定义的介于lowerValue和upperValue之间的元素范围.... 您可以在视图和基础SortedSet(Of T)中进行更改。

因此，对于每次更新，都应同步count字段（键和值已经相同）。以确保Count正确

有两种策略可达到目标：

Microsoft的
Mono的

首先，微软，在他们的代码中，他们牺牲了GetViewBetween()的性能，赢得了Count属性的性能。

VersionCheckImpl()是同步Count属性的一种方法。

其次，Mono。在Mono的代码中，GetViewBetween()更快，但在他们的GetCount()方法中：

internal override int GetCount ()
{
    int count = 0;
    using (var e = set.tree.GetSuffixEnumerator (lower)) {
        while (e.MoveNext () && set.helper.Compare (upper, e.Current) >= 0)
            ++count;
    }
    return count;
}

这总是一个O(N)的操作！

为什么SortedSet<T>.GetViewBetween不是O(log N)？

关于version字段

更新1：

首先，关于foreach：

更新2：

关于`version`字段

首先，关于`foreach`：