Java 8，从列表中返回重复项的最有效方法（而不是删除它们）是什么？

Question

Java 8，从列表中返回重复项的最有效方法（而不是删除它们）是什么？

javalistoptimizationduplicatesjava-stream

6

我有一个包含字符串的ArrayList，我想找到并返回列表中存在多次的所有值。大多数情况下都是寻找相反的结果（例如删除重复项，如distinct()），因此很难找到示例代码。

我能够想出以下代码：

public synchronized List<String> listMatching(List<String> allStrings) {

    long startTime = System.currentTimeMillis();

    List<String> duplicates = allStrings.stream().filter(string -> Collections.frequency(allStrings, string) > 1)
            .collect(Collectors.toList());

    long stopTime = System.currentTimeMillis();
    long elapsedTime = stopTime - startTime;
    LOG.info("Time for Collections.frequency(): "+ elapsedTime);

    return duplicates;
}

但是这个使用了Collections.frequency，它会为每个元素循环整个列表并计算每个出现的次数。在我的目前大约4,000个字符串列表上运行需要约150ms。对我来说有点慢，而且随着列表大小的增加，速度只会变得更慢。我重写了频率方法，使其在第二次出现时立即返回：

protected boolean moreThanOne(Collection<?> c, Object o) {
    boolean found = false;
    if (o != null) {
        for (Object e : c) {
            if (o.equals(e)) {
                if (found) {
                    return found;
                } else {
                    found = true;
                }
            }
        }
    }
    return found;
}

并且我改变了我的方法来使用它：

public synchronized List<String> listMatching(List<String> allStrings)   {
    long startTime = System.currentTimeMillis();

    List<String> duplicates = allStrings.stream().filter(string -> moreThanOne(allStrings, string))
            .collect(Collectors.toList());

    long stopTime = System.currentTimeMillis();
    long elapsedTime = stopTime - startTime;
    LOG.info("Time for moreThanOne(): "+ elapsedTime);

    return duplicates;
}

这个代码似乎按预期工作，但并没有像我希望的那样显著提高速度，大约为120毫秒。这可能是因为它还需要针对每个项目遍历整个列表，但我不确定如何避免这种情况并仍然完成任务。

我知道这可能看起来像过早优化，但我的列表很容易达到100万+，而这个方法是我应用程序的关键部分，影响其他所有内容的时间安排。

你有没有看到我可以进一步优化这段代码的方法？也许使用某种高级Predicate？或者完全不同的方法？

编辑：感谢您所有的建议，我能够想出一个明显更快的解决方案：

public synchronized Set<String> listMatching(List<String> allStrings) {

    Set<String> allItems = new HashSet<>();
    Set<String> duplicates = allStrings.stream()
            .filter(string -> !allItems.add(string))
            .collect(Collectors.toSet());

    return duplicates;
}

在相同的条件下运行，这可以在<5毫秒内通过我的列表。如果我需要知道计数，所有的HashMap建议都将是很好的选择。不确定为什么Collections.frequency()方法不使用该技术。

- Jonathon Hoaglin

你可以使用一个映射表，以你的对象作为键，并将计数器变量作为值。最后，你只需要获取那些值大于1的键。但这种方法可能与你的解决方案一样慢，甚至更慢。 - Lino

你看有没有什么方法可以进一步优化这段代码呢？你可以使用HashMap<String，Integer>，其中String是你要存储的内容，Integer是频率。这样，你的moreThanOne方法将从现在的O(n)变为O(1)。我认为这将大大减少时间。 - Anand Undavia

如果您可以将它们删除，那么您可以将它们存储在新列表中，这将是最佳解决方案。 - user177800

3个回答

2

构建一个包含每个字符串计数的Map是使其真正可扩展的好方法。要构建Map，您将在列表中查找每个字符串。如果该字符串尚未在Map中，则将该字符串和一个计数放入Map中。如果在Map中找到该字符串，则增加计数。

您可能希望使用一些允许您原地递增计数而无需每次都进行“put（）”更新的类型。例如，您可以使用具有一个元素的int []。

不重新放置计数的另一个优点是易于并行执行，因为您可以在要读取/写入计数时对包含计数的对象进行同步。

非并行代码可能如下所示：

Map<String, int[]> map = new HashMap<>(listOfStrings.size());
for (String s: listOfStrings) {
    int[] curCount = map.get(s);
    if (curCount == null) {
        curCount = new int[1];
        curCount[0] = 1;
        map.put(s, curCount);
    } else {
        curCount[0]++;
    }
}

然后，您可以遍历地图条目，并根据每个字符串的计数执行正确的操作。

- Rob

这正是我想到的。 - Bharat

https://dev59.com/PWAf5IYBdhLWcg3wOQq2#24872936 上的答案只是一个很好的单词计数示例，它使用computeIfPresent来原地递增计数。 - tanyehzheng

1

最好的数据结构将是 Set<String>。

Add all elements from list in set.

从列表中遍历，逐个删除集合中的元素。

If element not found in set then it's duplicate in list. (Because it's already deleted)

这将花费 O(n)+O(n) 的时间。

编程 -

    List<String> list = new ArrayList<>();
    List<String> duplicates = new ArrayList<>();

    list.add("luna");
    list.add("mirana");
    list.add("mirana");
    list.add("mirana");

    Set<String> set = new HashSet<>();
    set.addAll(list);
    for(String a:list){
        if(set.contains(a)){
            set.remove(a);
        }else{
            duplicates.add(a);
        }
    }
    System.out.println(duplicates);

输出

[mirana, mirana]

- nagendra547

使用 List 作为重复项的集合，如果一个字符串在原始列表中出现超过两次，那么最终会得到多个副本。将重复项设置为 Set 可以解决这个问题，但是你需要将 Set 转换为 List（如果确实需要）。 - Rob

这里我们不想改变原始列表，所以我在这里制作了一个副本。如果我们可以更改原始列表，那么就不要存储副本，而是循环遍历集合从原始列表中删除。希望清楚明白。 - nagendra547

如果你知道重复项很少，我更喜欢使用Set方法（而不是另一个答案中的Map），因为你不会最终得到一个包含所有非重复项条目的集合。 - Rob

我的评论是基于这样一个假设，即OP希望您的示例生成[mirana]而不是两次输出它。我重新阅读了问题，对于我来说不清楚在这种情况下需要什么输出。无论如何，这个细节都很容易正确编码。 - Rob

如果用户只想显示唯一的重复项，则在使用duplicateList结构时，可以选择使用duplicateSet，其中包含唯一的元素。 - nagendra547

为什么要踩票呢？ - nagendra547

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- dspano · Accepted Answer

查找重复项的简单方法是迭代列表并使用add()方法将项添加到其他临时集合中。如果该项已存在于集合中，则它会返回false。

public synchronized List<String> listMatching(List<String> allStrings) {
   Set<String> tempSet = new HashSet();
   Set<String> duplicates = new HashSet();

   allStrings.forEach( item -> {
       if (!tempSet.add(item)) duplicates.add(item);
   });

   return duplicates;
}