HashSet中的Union和UnionWith有什么区别？

Question

HashSet中的Union和UnionWith有什么区别？

27

HashSet.Union与HashSet.Unionwith在合并两个哈希集时有什么区别。

我尝试这样组合：

HashSet<EngineType> enginesSupportAll = _filePolicyEvaluation.EnginesSupportAll;
        enginesSupportAll = enginesSupportAll != null ? new HashSet<EngineType>(engines.Union(enginesSupportAll)) : enginesSupportAll;

什么是最佳方法以及为什么？

- Max.Futerman

4

UnionWith 会改变原始序列，Union 返回一个包含两个序列的并集的新序列。 - Lee

1

UnionWith方法会修改你原始的HashSet（在这个例子中是enginesSupportAll），它会将第二个“other”集合中的所有元素添加到原始集合中。而Union方法则会创建一个新的HashSet，其中包含与上述集合相同的元素，但它还有一个额外的可选参数，允许你指定一个IEqualityComparer，可以改变结果中产生的元素。 - trapsuutjies

2个回答

26

这里用的不是HashSet.Union而是Enumerable.Union，所以你正在使用一个适用于任何类型的IEnumerable<>的LINQ扩展方法，而HashSet.UnionWith则是一个真正的HashSet方法，可以修改当前实例。

Union返回一个IEnumerable<TSource>
UnionWith是void，它修改了当前的HashSet实例
也许UnionWith会稍微更有效率，因为它可以进行优化

如果您的方法不想支持任何类型的序列，只想使用HashSet并且可以修改它，请使用这个，否则请使用LINQ扩展。如果您仅出于此目的创建HashSet实例，则并不重要，我更喜欢使用LINQ来获得更大的灵活性并能够链接我的查询。

- Tim Schmelter

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Kasper van den Berg · Accepted Answer

给定 HashSet<T> A 和 HashSet<T> B，有四种方法可以得到 A ∪ B：

new HashSet<t>(A.Union(B))
_{(参见HashSet<T&>(IEnumerable<T>)和
Enumerable.Union<T>(IEnumerable<T>, IEnumerable<T>))}
A.UnionWith(B)
HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);
new HashSet<t>(A.Concat(B))
_{(参见Enumerable.Concat<T>(IEnumerable<T>, IEnumerable<T>))}

每种方法都有其优缺点：

1和4是表达式，它们的结果是一个HashSet，而2和3是语句或语句块。
与2或3相比，表达式1和4可以在更多的地方使用。例如，在linq查询语法表达式中使用2或3很麻烦：
from x in setofSetsA as IEnumerable<HashSet<T>> from y in setOfSetsB as IEnumerable<HashSet<T>> select x.UnionWith(y)将不起作用，因为UnionWith返回void。
1、3和4保留A和B，并返回一个新的集合，而2修改了A。
有些情况下，修改原始集合之一是不好的，而有些情况下，至少可以修改一个原始集合而不会产生负面影响。
计算成本：

A.UnionWith(B)
(≈ O((log(|A∪B|) - log(|A|)) * |A∪B|) + O(|B|))

≤

HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);
(≈ O((log(|A∪B|) - log(|A|)) * |A∪B|) + O(|A| + |B|))

≤

HashSet<T>(A.Concat(B))
(≈ O(log(|A∪B|) * |A∪B|) + O(|A| + |B|))

≤

HashSet<T>(A.Union(B))
(≈ 2 * O(log(|A∪B|) * |A∪B|) + O(|A| + |B| + |A∪B|))

下一节将深入研究参考源，以了解这些性能估计的基础。

性能

`HashSet<T>`

在选项1、3和4中，使用构造函数HashSet<T>(IEnumerable<T>, IEqualityComparer<T>)从一个IEnumerable<T>创建一个HashSet<T>。如果传递的IEnumerable<T>具有Count属性，即它是一个ICollection<T>，则此属性用于设置新HashSet的大小：

int suggestedCapacity = 0;
ICollection<T> coll = collection as ICollection<T>;
if (coll != null) {
    suggestedCapacity = coll.Count;
}
Initialize(suggestedCapacity);

-- HashSet.cs line 136–141

[Count()][10]方法从未被调用。因此，如果可以轻松检索到IEnumerable的计数，则使用它来保留容量；否则，当添加新元素时，HashSet会增长和重新分配。
在选项1中，A.Union(B)和选项4中的A.Concat(B)都不是ICollection<T>，因此创建的HashSet将会增长和重新分配几次（≈ log(|A∪B|)）。选项3可以使用A的Count。
构造函数调用UnionWith来填充新的空HashSet：

this.UnionWith(collection);
-- HashSet.cs line 143

UnionWith(IEnumerable<T>)会遍历作为参数传递的IEnumerable<T>中的元素，并对每个元素调用AddIfNotPresent(T)。

AddIfNotPresent(T)插入元素并确保不会将重复元素插入到集合中。
HashSet<T>被实现为一个包含槽位的数组，m_slots，以及一个包含桶的数组，m_buckets。每个桶只包含一个指向m_slots数组的整数索引。对于每个桶，m_slots中的Slot形成一个链接列表，其索引指向m_slots中的下一个Slot。

AddIfNotPresent(T)跳转到正确的桶，然后遍历其链接列表以检查元素是否已经存在：

for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) {
    if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, value)) {
        return false;
    }
}

-- HashSet.cs line 968–975

下一步，会找到一个空闲索引并预留一个插槽。首先检查空闲插槽列表 m_freelist。当空闲插槽列表中没有插槽时，将使用 m_slots 数组中的下一个空插槽。如果空闲插槽列表和数组中都没有可用插槽，则会通过 IncreaseCapacity() 方法增加容量。请注意，HTML 标签已被保留，但不包括解释。

int index;
if (m_freeList >= 0) {
    index = m_freeList;
    m_freeList = m_slots[index].next;
}
else {
    if (m_lastIndex == m_slots.Length) {
        IncreaseCapacity();
        // this will change during resize
        bucket = hashCode % m_buckets.Length;
    }
    index = m_lastIndex;
    m_lastIndex++;
}

-- HashSet.cs line 977–990

AddIfNotPresent(T)有三个需要一些计算的操作：调用object.GetHashCode()，在发生冲突时调用object.Equals(object)，以及IncreaseCapacity()。实际添加元素只需设置一些指针和整数的成本。

当HashSet<T>需要IncreaseCapacity()时，容量至少加倍。因此我们可以得出结论，平均情况下一个HashSet<T>填充75%。如果哈希值均匀分布，则哈希冲突的期望也为75%。

SetCapacity(int, bool)是最昂贵的，它由IncreaseCapacity()调用，它分配新数组，将旧槽数组复制到新数组，并重新计算桶列表:

Slot[] newSlots = new Slot[newSize];
if (m_slots != null) {
    Array.Copy(m_slots, 0, newSlots, 0, m_lastIndex);
}

...

int[] newBuckets = new int[newSize];
for (int i = 0; i < m_lastIndex; i++) {
    int bucket = newSlots[i].hashCode % newSize;
    newSlots[i].next = newBuckets[bucket] - 1;
    newBuckets[bucket] = i + 1;
}
m_slots = newSlots;
m_buckets = newBuckets;

-- HashSet.cs line 929–949

选项1和4（new HashSet<T>(A.Union(B))）会导致更多的IncreaseCapacity()调用。在不考虑A.Union(B)或A.Concat(B)的成本的情况下，成本大约为O(log(|A∪B|) * |A∪B|)。
而当使用选项2（A.UnionWith(B)）或选项3（HashSet<T> C = new HashSet<T>(A); C.UnionWith(B)）时，我们在成本上获得了log（|A|）的“折扣”：O((log(|A∪B|) - log(|A|)) * |A∪B|)。将另一个集合合并到最大的集合中作为目标时，这样做是划算的。

`Enumerable<T>.Union(IEnumerable<T>)`

Enumerable<T>.Union(IEnumerable<T>)通过UnionIterator<T>(IEnumerable<T>, IEnumerable<T>, IEqualityComparer<T>)实现。
UnionIterator使用Set<T>——位于Enumerable.cs中的内部类，与HashSet<T>非常相似。 UnionIterator将A和B的项惰性地添加到此Set<T>中，并在可以添加时yield元素。工作是在Find(T, bool)中完成，该方法类似于HashSet<T>.AddIfNotPresent(T)。检查元素是否已经存在：

int hashCode = InternalGetHashCode(value);
for (int i = buckets[hashCode % buckets.Length] - 1; i >= 0; i = slots[i].next) {
    if (slots[i].hashCode == hashCode && comparer.Equals(slots[i].value, value)) return true;
}

-- Enumerable.cs line 2423–2426

查找一个空闲的索引并保留一个位置：

int index;
if (freeList >= 0) {
    index = freeList;
    freeList = slots[index].next;
}
else {
    if (count == slots.Length) Resize();
    index = count;
    count++;
}
int bucket = hashCode % buckets.Length;
slots[index].hashCode = hashCode;
slots[index].value = value;
slots[index].next = buckets[bucket] - 1;
buckets[bucket] = index + 1;

-- Enumerable.cs line 2428–2442

Resize()和IncreaseCapacity()相似，两者之间最大的区别在于Resize()不使用质数作为桶的数量，在使用糟糕的GetHashCode()时有稍微更高的碰撞几率。以下是Resize()的代码：

int newSize = checked(count * 2 + 1);
int[] newBuckets = new int[newSize];
Slot[] newSlots = new Slot[newSize];
Array.Copy(slots, 0, newSlots, 0, count);
for (int i = 0; i < count; i++) {
    int bucket = newSlots[i].hashCode % newSize;
    newSlots[i].next = newBuckets[bucket] - 1;
    newBuckets[bucket] = i + 1;
}
buckets = newBuckets;
slots = newSlots;

-- Enumerable.cs line 2448–2458

< p > A.Union(B) 的性能成本与 HashSet<T> C = new HashSet<T>(); C.UnionWith(A); C.UnionWith(B); 相比没有明显差异。在选项1中 (new HashSet<T>(A.Union(B)))，同一个 HashSet 被创建了两次，导致非常昂贵的 2 * O(log(|A∪B|) * (|A∪B|))。选项4源于了解 HashSet<T>(IEnumerable<T>) 和 Enumerable.Union(IEnumerable<T>, IEnumerable<T>) 的实现方式。它避免了冗余的 A.Union(B)，导致成本为 O(log(|A∪B|) * |A∪B|)。