快速计算两个字符串数组的交集数量

Question

快速计算两个字符串数组的交集数量

6

我需要快速计算两个大字符串数组相交的元素数量。

我正在使用以下代码：

arr1[i].Intersect(arr2[j]).Count()

针对CPU时间，VS分析器指出：

System.Linq.Enumerable.Count()占用了85.1%的时间
System.Linq.Enumerable.Intersect()占用了0.3%的时间

不幸的是，这样的工作可能需要数小时。

如何更快地完成？

- LeMoussel

3

您从分析器得到的数字可能不是“正确”的。因为在使用.Intersect()时，交集并没有被执行，而是在使用.Count()时整个查询才被执行。这就是LINQ的本质。我怀疑在求交集时需要比计数更多的工作量。如果您真的需要优化性能，请尝试不使用LINQ来完成它。 - cpt. jazz

如果数据量足够大，就将其存储在数据库中，或者创建一组计算机/线程集群，也许可以进行一些MapReduce操作。 - Kieren Johnstone

你是在交叉比较arr1和arr2中的字符串，还是在比较arr1中每个字符串的每个字符与arr2中每个字符串的每个字符？ - Alex Filipovici

我认为这很难成为你的瓶颈。在我的测试中，对于两个包含500万个字符串（平均长度为60个字符）的数组进行Intersect + Count操作大约需要3.5秒... 你的数组有多大？ - digEmAll

5个回答

1

我怀疑分析器显示时间被消耗在Count的原因是这里实际上枚举了集合（Intersect是惰性评估的，在需要结果之前不会运行）。

我相信Intersect应该有一些内部优化，使其变得相当快，但您可以尝试使用HashSet<string>，以便确保可以进行交集而无需搜索每个元素的内部数组：

HashSet<string> set = new HashSet<string>(arr1);
set.IntersectWith(arr2);
int count = set.Count;

- driis

很奇怪，在我的测试中，这个方法比原始版本和我改正过的"Some1"答案的速度都慢，尽管我本来以为它会更好。 - Rawling

1

嗯，交集可能是N^2。

为了加快速度，可以对两个数组进行快速排序，然后遍历两个数组。计算交集。

懒得测试它的速度，但应该是O（nlogn + n）。

    using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            const int arrsize = 1000000;
            Random rnd = new Random(42);
            string[] arr1 = new string[arrsize];
            string[] arr2 = new string[arrsize];
            for (int i = 0; i < arrsize; i++)
            {
                arr1[i] = rnd.Next().ToString();
                arr2[i] = rnd.Next().ToString();
            }
            {
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                arr1.Intersect(arr2).Count();
                Console.WriteLine("array" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

        {

            HashSet<string> set = new HashSet<string>(arr1);
            var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
            set.IntersectWith(arr2);
            int count = set.Count;
            Console.WriteLine("HashSet" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
        }
            {
               var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                HashSet<string> set = new HashSet<string>(arr1);
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("HashSet + new" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

            {
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                SortedSet<string> set = new SortedSet<string>(arr1);
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("SortedSet +new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

            {

                SortedSet<string> set = new SortedSet<string>(arr1);
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("SortedSet without new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }
        }
    }
}

结果

数组 914,637

哈希集合 816,119

新的哈希集合 1,150,978

新的排序集合 16,173,836

不使用新的排序集合 7,946,709

因此，最好的方法是保持一个准备好的哈希集合。

- Nahum

Linq扩展中的Intersect实现为O(n log n)。 - user287107

在 HashSet 中的交集也是以 O(n log n) 实现的，我刚才查看了反汇编代码。 - user287107

@user287107：http://c-sharp-snippets.blogspot.it/2010/03/runtime-complexity-of-net-generic.html - digEmAll

我知道它可能会更快，但是请看一下IntersectIterator类的实现。迭代器执行以下操作：1）创建新集合，2）遍历source1，将项目添加到集合中，3）遍历source2，如果删除成功，则yield返回元素。添加和删除的时间复杂度为O(log n)，通过循环，时间复杂度为O(n log n)。 - user287107

@user287107：Linq.Enumerable.Set 的添加和删除操作是 O(1) 操作（与 HashSet<> 相同），并且通过循环的时间复杂度是 O(n*1) --> O(n)。 - digEmAll

显示剩余3条评论

0

当你使用集合进行工作时，你的复杂度将会是O((n log n)*(m log m))或者类似的。

我认为这里应该更快，但我不确定它现在是否是O((n log n)+(m log m))。

possible would be 
var Set1 = arr1[i].Distinct().ToArray(); // if necessary, if arr1 or arr2 could be not distinct
var Set2 = arr2[j].Distinct().ToArray();  

nCount = Set1.Count() + Set2.Count() - Set1.Append(Set2).Distinct().Count();

- user287107

0

使用较小的数组构建一个 HashSet，然后循环遍历更大的数组，如果元素存在于 HashSet 中，则增加计数器。

- Eren Ersönmez

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Anirudha · Accepted Answer

您可以使用HashSet与arr2。

HashSet<string> arr2Set = new HashSet<string>(arr2);
arr1.Where(x=>arr2Set.Contains(x)).Count();
              ------------------
                      |
                      |->HashSet's contains method executes quickly using hash-based lookup..

不考虑从arr2转换为arr2Set，这应该是O(n)