List<T>.Contains() 很慢？

Question

List<T>.Contains() 很慢？

.netarraysgenericslist

101

请问有人能解释一下为什么泛型 List.Contains() 函数会这么慢吗？

我有一个包含大约一百万个数字的 List<long>，代码在不断检查这些数字中是否有特定数字。

我尝试使用 Dictionary<long, byte> 和 Dictionary.ContainsKey() 函数来完成同样的事情，速度比 List 快10-20倍。

当然，我不想在这种情况下使用 Dictionary，因为它并不是为这种用途而设计的。

所以，真正的问题是，是否有任何替代方案来替代 List<T>.Contains()，但不像 Dictionary<K,V>.ContainsKey() 那样奇怪？

- DSent

3

字典有什么问题吗？它的使用就是为了像你这样的情况。 - Kamarey

4

@Kamarey：使用 HashSet 可能是更好的选择。 - Brian Rasmussen

HashSet 就是我在寻找的。 - DSent

8个回答

33

List.Contains是一个O(n)的操作。

Dictionary.ContainsKey是一个O(1)的操作，因为它使用对象的哈希码作为键，从而提供了更快的搜索能力。

我认为，在包含一百万条目的列表中扫描以查找几个条目并不是一个好主意。

难道不可能将这些百万实体保存到例如关系型数据库，并在该数据库上执行查询吗？

如果不可能，那么在进行键查找时仍然可以使用Dictionary。

- Frederik Gheysels

14

我认为拥有一百万项的列表本身并没有什么不妥，只是你可能不想在其中进行线性搜索。 - Will Dean

同意，列表或数组中有那么多条目并没有什么问题。只是不要扫描值。 - Michael Krauklis

9

我想我有答案！是的，对于列表（数组）上的Contains（），其时间复杂度为O(n), 但如果数组很短并且您正在使用值类型，则仍然应该相当快。但是，使用CLR Profiler [从Microsoft免费下载]，我发现Contains()正在装箱比较值，这需要堆分配，这非常昂贵（慢）。[注意：这是.Net 2.0;其他.Net版本未经过测试。]

以下是完整的故事和解决方案。我们有一个名为“VI”的枚举，并创建了一个名为“ValueIdList”的类，它是VI对象列表（数组）的抽象类型。原始实现是古老的.Net 1.1版本，并且使用了封装的ArrayList。最近我们在http://blogs.msdn.com/b/joshwil/archive/2004/04/13/112598.aspx中发现泛型列表（List < VI >）在值类型（如我们的枚举VI）上表现比ArrayList好得多，因为不必要求值必须装箱。这确实是真的，而且也奏效了...几乎。

CLR Profiler揭示了一个惊喜。以下是分配图的一部分:

ValueIdList::Contains bool(VI) 5.5MB (34.81%)
Generic.List::Contains bool(<UNKNOWN>) 5.5MB (34.81%)
Generic.ObjectEqualityComparer<T>::Equals bool (<UNKNOWN> <UNKNOWN>) 5.5MB (34.88%)
Values.VI 7.7MB (49.03%)

正如您所看到的，Contains() 函数出人意料地调用了 Generic.ObjectEqualityComparer.Equals() 函数，这显然需要对 VI 值进行装箱，从而需要昂贵的堆分配。奇怪的是，微软消除了列表上的装箱，却在像这样的简单操作中再次要求它。

我们的解决方案是重新编写 Contains() 实现，在我们的情况下很容易做到，因为我们已经封装了通用列表对象(_items)。以下是简单的代码:

public bool Contains(VI id) 
{
  return IndexOf(id) >= 0;
}

public int IndexOf(VI id) 
{ 
  int i, count;

  count = _items.Count;
  for (i = 0; i < count; i++)
    if (_items[i] == id)
      return i;
  return -1;
}

public bool Remove(VI id) 
{
  int i;

  i = IndexOf(id);
  if (i < 0)
    return false;
  _items.RemoveAt(i);

  return true;
}

我们现在使用自己的IndexOf()版本进行VI值比较，无需装箱，速度非常快。我们的特定程序在这个简单的重写后加速了20%。O(n)…没问题！只需避免浪费内存使用！

- Kevin North

谢谢你的建议，我自己也被糟糕的装箱性能所困扰。对于我的使用情况来说，自定义的Contains实现速度要快得多。 - Lea Hayes

5

字典并不差，因为字典中的键被设计成可快速查找。要在列表中查找一个数字需要遍历整个列表。

当然，字典只适用于您的数字是唯一且无序的情况。

我认为在.NET 3.5中也有一个HashSet<T>类，它也只允许唯一元素。

- Stefan Steinegger

一个字典类型<类别, 整数>可以有效地存储非唯一对象 - 使用整数来计算重复的数量。例如，您可以将列表 {a，b，a} 存储为 {a=2，b=1}。当然，它会失去顺序。 - MSalters

4

这并不是对你问题的直接回答，但我有一个类可以提高集合Contains（）方法的性能。我建立了一个Queue的子类，并添加了一个字典，将哈希码映射到对象列表。而Dictionary.Contains（）函数的复杂度为O（1），而List.Contains（）、Queue.Contains（）和Stack.Contains（）的复杂度为O（n）。

字典的值类型是一个包含具有相同哈希码的对象的队列。调用者可以提供实现IEqualityComparer的自定义类对象。您可以使用此模式来处理栈或列表。只需进行一些少量更改即可完成代码。

/// <summary>
/// This is a class that mimics a queue, except the Contains() operation is O(1) rather     than O(n) thanks to an internal dictionary.
/// The dictionary remembers the hashcodes of the items that have been enqueued and dequeued.
/// Hashcode collisions are stored in a queue to maintain FIFO order.
/// </summary>
/// <typeparam name="T"></typeparam>
private class HashQueue<T> : Queue<T>
{
    private readonly IEqualityComparer<T> _comp;
    public readonly Dictionary<int, Queue<T>> _hashes; //_hashes.Count doesn't always equal base.Count (due to collisions)

    public HashQueue(IEqualityComparer<T> comp = null) : base()
    {
        this._comp = comp;
        this._hashes = new Dictionary<int, Queue<T>>();
    }

    public HashQueue(int capacity, IEqualityComparer<T> comp = null) : base(capacity)
    {
        this._comp = comp;
        this._hashes = new Dictionary<int, Queue<T>>(capacity);
    }

    public HashQueue(IEnumerable<T> collection, IEqualityComparer<T> comp = null) :     base(collection)
    {
        this._comp = comp;

        this._hashes = new Dictionary<int, Queue<T>>(base.Count);
        foreach (var item in collection)
        {
            this.EnqueueDictionary(item);
        }
    }

    public new void Enqueue(T item)
    {
        base.Enqueue(item); //add to queue
        this.EnqueueDictionary(item);
    }

    private void EnqueueDictionary(T item)
    {
        int hash = this._comp == null ? item.GetHashCode() :     this._comp.GetHashCode(item);
        Queue<T> temp;
        if (!this._hashes.TryGetValue(hash, out temp))
        {
            temp = new Queue<T>();
            this._hashes.Add(hash, temp);
        }
        temp.Enqueue(item);
    }

    public new T Dequeue()
    {
        T result = base.Dequeue(); //remove from queue

        int hash = this._comp == null ? result.GetHashCode() : this._comp.GetHashCode(result);
        Queue<T> temp;
        if (this._hashes.TryGetValue(hash, out temp))
        {
            temp.Dequeue();
            if (temp.Count == 0)
                this._hashes.Remove(hash);
        }

        return result;
    }

    public new bool Contains(T item)
    { //This is O(1), whereas Queue.Contains is (n)
        int hash = this._comp == null ? item.GetHashCode() : this._comp.GetHashCode(item);
        return this._hashes.ContainsKey(hash);
    }

    public new void Clear()
    {
        foreach (var item in this._hashes.Values)
            item.Clear(); //clear collision lists

        this._hashes.Clear(); //clear dictionary

        base.Clear(); //clear queue
    }
}

我的简单测试表明，我的 HashQueue.Contains() 比 Queue.Contains() 运行速度更快。使用计数设置为 10,000 运行测试代码，HashQueue 版本需要 0.00045 秒，而 Queue 版本需要 0.37 秒。当计数为 100,000 时，HashQueue 版本需要 0.0031 秒，而 Queue 版本需要 36.38 秒！

以下是我的测试代码：

static void Main(string[] args)
{
    int count = 10000;

    { //HashQueue
        var q = new HashQueue<int>(count);

        for (int i = 0; i < count; i++) //load queue (not timed)
            q.Enqueue(i);

        System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
        for (int i = 0; i < count; i++)
        {
            bool contains = q.Contains(i);
        }
        sw.Stop();
        Console.WriteLine(string.Format("HashQueue, {0}", sw.Elapsed));
    }

    { //Queue
        var q = new Queue<int>(count);

        for (int i = 0; i < count; i++) //load queue (not timed)
            q.Enqueue(i);

        System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
        for (int i = 0; i < count; i++)
        {
            bool contains = q.Contains(i);
        }
        sw.Stop();
        Console.WriteLine(string.Format("Queue,     {0}", sw.Elapsed));
    }

    Console.ReadLine();
}

- user2023861

我刚刚为HashSet<T>添加了第三个测试用例，似乎比你的解决方案获得了更好的结果：

`HashQueue, 00:00:00.0004029`
`Queue,     00:00:00.3901439`
`HashSet,   00:00:00.0001716`

- psulek

这太棒了，比队列运行速度快得多。 - fizgig

2

SortedList会更快地进行搜索（但插入项目的速度较慢）。

- Mitch Wheat

1

为什么字典不适用？

如果要查看列表中是否存在特定值，您需要遍历整个列表。使用字典（或其他基于哈希的容器）可以更快地缩小需要与之比较的对象数量。键（在您的情况下是数字）被散列，这给字典提供了相对子集的对象来进行比较。

- Andrew

0

我正在使用紧凑框架，在那里不支持HashSet，我选择了一个字典，其中两个字符串都是我要查找的值。

这意味着我可以获得列表<>功能和字典性能。这有点hacky，但它有效。

- Mark McGookin

1

如果你使用字典代替哈希集合，那么最好将值设置为""而不是与键相同的字符串。这样可以节省内存。或者你甚至可以使用Dictionary<string,bool>并将它们全部设置为true（或false）。我不知道哪种方法会使用更少的内存，是空字符串还是布尔值。我的猜测是布尔值。 - TTT

在字典中，一个 string 引用和一个 bool 值在32位或64位系统中的大小差别为3或7个字节。但需要注意的是，每个条目的大小都分别向上舍入为4或8的倍数。因此，选择使用 string 还是 bool 可能根本不会影响其大小。空字符串 "" 总是作为静态属性 string.Empty 存在于内存中，所以无论你是否在字典中使用它都没有任何区别（而且它在其他地方也被使用）。 - Wormbo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Marc Gravell · Accepted Answer

如果您只是想检查存在性，.NET 3.5 中的 HashSet<T> 是您最好的选择 - 类似于字典的性能，但没有键值对，只有值：

    HashSet<int> data = new HashSet<int>();
    for (int i = 0; i < 1000000; i++)
    {
        data.Add(rand.Next(50000000));
    }
    bool contains = data.Contains(1234567); // etc