如何高效地从一个向量中删除元素，给定另一个向量？

Question

如何高效地从一个向量中删除元素，给定另一个向量？

7

如何通过另一个向量来删除向量中的元素是最好的方法？

我想到了以下代码：

#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;

void remove_elements(vector<int>& vDestination, const vector<int>& vSource) 
{
    if(!vDestination.empty() && !vSource.empty())
    {
        for(auto i: vSource) {
            vDestination.erase(std::remove(vDestination.begin(), vDestination.end(), i), vDestination.end());
        }
    }
}

int main() 
{
    vector<int> v1={1,2,3};
    vector<int> v2={4,5,6};
    vector<int> v3={1,2,3,4,5,6,7,8,9};
    remove_elements(v3,v1);
    remove_elements(v3,v2);
    for(auto i:v3)
        cout << i << endl;
    return 0;
}

输出结果如下：

7
8
9

- user3898160

假设向量足够大，以至于有必要优化。我会首先将vDestination转换为std::list(或智能指针列表?)，以避免在std::vector中进行昂贵的删除操作（因为它必须连续），最后再转回std::vector。 - slawekwin

@slawekwin：我强烈怀疑对于任何实际大小的向量，这种方法都不会更快。列表需要额外的重定向，并且不能有效地缓存。(我知道对于列表而言，元素的删除是O(1)，而对于向量而言是O(N)。) - Frank Puffer

@Frank 我也不确定，但我相信测量它会是一个有趣的实验 :) - slawekwin

1

所有向量都已排序吗？ - Bob__

2

如果先创建一个unordered_set(来自vSource)是否有助于提高效率，这可能值得一试。理论上应该将最坏情况从N * M改为N + M（假设N是vDestination的大小，M是vSource的大小-每个vDestination元素只需查找一次，而不是多达M次查找+集合创建），但这取决于桶配置。另外，您没有提到重复项会发生什么-应该删除一次还是全部删除？ - Tomasz Lewowski

@Bob__：向量没有排序，但是在这个例子中已经排过序了。 - user3898160

4个回答

3

如果你的向量总是排序好的，你可以使用 set_difference：

#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>

void remove_elements(std::vector<int>& vDestination, const std::vector<int>& vSource) 
{
    std::vector<int> result;
    std::set_difference(vDestination.begin(), vDestination.end(), vSource.begin(), vSource.end(), std::back_inserter(result));
    vDestination.swap(result);
}

int main() 
{
    std::vector<int> v1={1,2,3};
    std::vector<int> v2={4,5,6};
    std::vector<int> v3={1,2,3,4,5,6,7,8,9};
    remove_elements(v3,v1);
    remove_elements(v3,v2);
    for(auto i:v3)
        std::cout << i << '\n';
}

如果没有要求，输出范围就不应该与任何输入范围重叠，甚至可以避免使用额外的向量。潜在地，您可以自己编写set_difference的版本，允许在以vDestination.begin()开头的范围内输出，但这超出了本答案的范围。

- Revolver_Ocelot

3

我假设你所说的“最好”是指“最快且有效”的算法。由于这是一个关于效率的问题，我进行了简单的基准测试，比较了几种算法的效率。请注意，它们有些不同，因为问题有点未明确 - 引起的问题（和用于基准测试的假设）是：

是否保证vDestination包含来自vSource的所有元素？（假设：否）
在vDestination或vSource中允许重复吗？（假设：是，在两个向量中都是如此）
结果向量中元素的顺序是否重要？（测试了两种情况的算法）
如果vDestination中的任何元素等于vSource中的任何元素，则是否应删除vDestination中的每个元素，还是一对一？（假设：是，在两个向量中都是如此）
vDestination和vSource的大小是否受到限制？它们中的一个总是更大或者非常大吗？（测试了几种情况）
在注释中已经解释了向量不需要排序，但是我包括了这一点，因为从问题中不容易看出（假设两个向量都不排序）

如您所见，算法将在几个点上有所不同，因此，可以猜测最佳算法取决于您的用例。比较的算法包括：

原始算法（提出问题）- 基线
@dkg答案中提出的算法
@Revolver_Ocelot答案中提出的算法+额外的排序（算法所需）和预留空间给结果向量
@Jarod42答案中提出的算法
基于集合的算法（下面介绍 - 主要是@Jarod42算法的优化）
计数算法（下面介绍）

基于集合的算法：

std::unordered_set<int> elems(vSource.begin(), vSource.end());
auto i = destination.begin();
auto target = destination.end();
while(i <= target) {
    if(elems.count(*i) > 0) 
        std::swap(*i, *(--target));
    else
        i++;
}
destination.erase(target, destination.end());

计数算法：

std::unordered_map<int, int> counts;     
counts.max_load_factor(0.3);     
counts.reserve(destination.size());      

for(auto v: destination) {     
    counts[v]++;     
}     

for(auto v: source) {     
    counts[v]--;     
}     

auto i = destination.begin();     
for(auto k: counts) {     
    if(k.second < 1) continue;            
    i = std::fill_n(i, k.second, k.first);     
}     
destination.resize(std::distance(destination.begin(), i));

使用Celero库执行基准测试过程如下：

生成n个伪随机int（其中n在集合{10,100,1000,10000,20000, 200000}中），并将它们放到一个vector中。
将其中的一部分整数（数量占比为m）复制到第二个vector（占比集合为{0.01, 0.1, 0.2, 0.4, 0.6, 0.8}，至少有一个元素）。
启动计时器。
执行删除程序。
停止计时器。

由于其余算法所需时间太长，我只对包含超过10,000个元素的数据集执行了算法3、5和6。如果你愿意，可以自己进行测试。

简而言之：如果你的向量包含少于1000个元素，则选择任何你喜欢的方法。如果它们更长，请依赖于vSource的大小。如果它小于vDestination的50%，则选择基于集合的算法；如果它大于50%，则对它们进行排序并选择@Revolver_Ocelot的解决方案（它们在约60%上并列，基于集合的算法对于vSource为vDestination大小的1%时速度可提高2倍）。请不要依赖顺序或提供从一开始就排序的向量——要求保持顺序会严重减慢处理速度。在你的用例，编译器、标志和硬件上进行基准测试。我附上了我的基准测试链接，以便你复制它们。

完整结果（文件vector-benchmarks.csv）可以在GitHub上与基准测试代码（文件tests/benchmarks/vectorRemoval.cpp）一起查看here。

请记住这些是我在我的计算机，我的编译器等等中获得的结果——在你的情况下，它们会有所不同（特别是当涉及到一个算法比另一个更好的点时）。我使用的是Fedora 24上的GCC 6.1.1和-O3。

- Tomasz Lewowski

1

可以使用STL编写如下：

void remove_elements(vector<int>& vDestination, const vector<int>& vSource) 
{
    const auto isInSource = [&](int e) {
        return std::find(vSource.begin(), vSource.end(), e) != vSource.end();
    };
    vDestination.erase(
        std::remove_if(vDestination.begin(), vDestination.end(), isInSource),
        vDestination.end());
}

如果vSource已排序，您可以使用std::binary_search替换std::find。

- Jarod42

此解决方案也在@dkg的答案的第一条评论中。 - Hayt

@Hayt：确实，我没有跟进这个链接。 - Jarod42

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- dkg · Accepted Answer

我的版本如下，我只在所有来自向量vSource的元素被std::remove移动到末尾后应用erase，并跟踪指向向量vDestination末尾的指针，以免无意义地迭代它。

void remove_elements(vector<int>& vDestination, const vector<int>& vSource) 
{
    auto last = std::end(vDestination);
    std::for_each(std::begin(vSource), std::end(vSource), [&](const int & val) {
        last = std::remove(std::begin(vDestination), last, val);
    });
    vDestination.erase(last, std::end(vDestination));
}

在coliru上查看：http://coliru.stacked-crooked.com/a/6e86893babb6759c

更新

这里有一个模板版本，所以您不必担心容器类型：

template <class ContainerA, class ContainerB>
void remove_elements(ContainerA & vDestination, const ContainerB & vSource) 
{
    auto last = std::end(vDestination);
    std::for_each(std::begin(vSource), std::end(vSource), [&](typename ContainerB::const_reference val) {
        last = std::remove(std::begin(vDestination), last, val);
    });
    vDestination.erase(last, std::end(vDestination));
}

注意

这个版本适用于没有任何限制的向量，如果你的向量已经排序，你可以采取一些捷径，避免反复迭代向量以删除每个元素。