在数组中找到所有重复元素的索引

Question

在数组中找到所有重复元素的索引

3

给定一个整数数组，找出其中所有重复元素的索引。

例如，考虑数组 A = [4, 12, 9, 8, 9, 12, 7, 1]。由于12和9有重复元素，它们所有的索引值将被返回，即 d = [2, 3, 5, 6]。数组A的长度不超过200，整数元素介于1和5000之间。

目前我正在使用以下函数。但是为了满足我的要求，这个函数减缓了我的速度。是否存在可能提高性能的方法？

function d = fincDuplicates(A)
    U = unique(A);
    [co,ce] = hist(A,U);
    an = ce(co>1);
    d=[];
    for i=1:numel(an)
        d=[d,find(A==an(i))];
    end
end

- impopularGuy

Profiler 显示该函数使用了总时间的 60%（抱歉我无法分享整个代码）。unique 和 hist 函数是主要原因。我在这里调用 hist 函数来查找数组中唯一元素的频率，并查找频率大于 1 的元素的索引。 - impopularGuy

2

一种非常简短但高度内存效率低下的解决方案，适用于长度(A) < 200的情况：d = find(sum(A==A.')>1)。 - obchardon

@obchardon 到目前为止，你的解决方案是最好的。当长度(A)<100时速度最快。 - impopularGuy

@obchardon，如果我尝试你的方法，我一直收到“Error using == , Matrix dimensions must agree.”的错误。我错过了什么吗？它是否依赖于隐式扩展？ - Hoki

1

@Hoki 确实涉及到了隐式扩展（Matlab R2016b 及以上版本 / Octave 3.6.0 及以上版本）。 - obchardon

是的，我试图用显式扩展来模仿它（只有R2016a），但性能不佳。我想知道新的隐式扩展是否更快。 - Hoki

4个回答

4

我来晚了，但这个问题需要一个基于accumarray的解决方案 :-)

d = find(ismember(A, find(accumarray(A(:), 1)>1)));

这利用了A包含小的正整数的事实，因此它们可以被解释为索引。 它的工作原理如下：

                          accumarray(A(:), 1)      % count of occurrences of each value
                     find(                   >1)   % values occurring more than once
d = find(ismember(A,                            ); % their positions in A

作为替代方案，可以使用sparse来代替accumarray：

d = find(ismember(A, find(sparse(A, 1, 1)>1)));

- Luis Mendo

3

以下是一个解决方案（来源：此方案改编自 https://uk.mathworks.com/matlabcentral/answers/175086-finding-non-unique-values-in-an-array）：

A = [4, 12, 9, 8, 9, 12, 7, 1];
[~,idxu,idxc] = unique(A);
[count, ~, idxcount] = histcounts(idxc,numel(idxu));
idxkeep = count(idxcount)>1;
idx_A = 1:length(A);
idx_dup = idx_A(idxkeep);

它会提供以下内容：

>> idx_dup = idx_A(idxkeep)

idx_dup =

     2     3     5     6

不确定它是否比您当前的解决方案更有效率。您可能需要使用实际数据进行测试。

- am304

我的实现大约花了15秒，而你的大约花了19秒。 - impopularGuy

@impopularGuy，这很奇怪，我测量的完全相反，你可以在我的答案中看到。 - Ander Biguri

它必须取决于数组大小 - impopularGuy

@impopularGuy 当然会影响。这就是为什么我测试了广泛的范围并提供了代码，让你也可以做同样的测试。不过，你的速度总是最慢的。 - Ander Biguri

我知道我的速度最慢。这就是为什么我发了这个问题。 - impopularGuy

显示剩余4条评论

2

为了完整起见，这里是其他答案的结果，与您的答案进行比较，并加速您的答案（在更好的人来拯救您之前我正在处理）。对于您问题中的大小：

for ii=1:100
    a=randi(5000,1,200);
    t1(ii)=timeit(@()yours(a));

    a=randi(5000,1,200);
    t2(ii)=timeit(@()faster(a));

    a=randi(5000,1,200);
    t3(ii)=timeit(@()hoki(a));

    a=randi(5000,1,200);
    t4(ii)=timeit(@()am304(a));
end
disp(['Faster: x', num2str(mean(t1)/mean(t2))])
disp(['hoki: x', num2str(mean(t1)/mean(t3))])
disp(['am304: x', num2str(mean(t1)/mean(t4))])
disp(['Faster: x', num2str(t1/t2)])
disp(['hoki: x', num2str(t1/t3)])
disp(['am304: x', num2str(t1/t4)])
function d = yours(A)
    U = unique(A);
    [co,ce] = hist(A,U);
    an = ce(co>1);
    d=[];
    for i=1:numel(an)
        d=[d,find(A==an(i))];
    end
end

function d = faster(A)
    [co] = histcounts(A,max(A));
    an = co>1;
    d=[];
    for i=1:numel(an)
        d=[d,find(A==an(i))];
    end
end

function res=am304(A)
[~,idxu,idxc] = unique(A);
[count, ~, idxcount] = histcounts(idxc,numel(idxu));
idxkeep = count(idxcount)>1;
idx_A = 1:length(A);
res = idx_A(idxkeep);
end

function res=hoki(A)
[B,I] = sort(A) ;
dx = find(diff(B)==0) ;
res = I([dx dx+1]) ;
end

结果如下：

Faster: x0.0054505
hoki: x7.4142
am304: x1.0881

我的快速版本在这种情况下表现很糟糕。

我理解Hoki的答案在大数组中稍微快一点，但在小数组中要快得多，并且根据a的大小��范围，它比am304的答案快2到30倍。

- Ander Biguri

1

太棒了。然而这种测量是不可信的。如果我们能为每个单独的函数调用生成一个新的随机数组，那就更好了。 - impopularGuy

@impopularGuy 我没有发布任何特定时间，所以你可以自己复制粘贴3行代码 ;) 结果仍然相同。 - Ander Biguri

1

@impopularGuy 如果你为每个单独的函数调用生成一个新的随机数组，那么你就不能进行逐一比较。要比较函数的性能，它们需要在相同的数据上操作，否则这是毫无意义的。 - am304

1

@am304现在已经重复了100次，所以希望差异应该平均分布。 - Ander Biguri

@am304 当然它们需要在相同的数据上操作。我可能没有能够很好地解释清楚。 - impopularGuy

@impopularGuy，你拥有am304的观点，即如果你改变了数据，比较就不再有效。他说得有道理。 - Ander Biguri

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Hoki · Accepted Answer

修改:

1: 修正了评论中提出的边缘情况，更新了基准测试。

2: 向基准测试添加了“扩展”解决方案（必须将最大N元素减少到20000）。

3: 增加了accumarray方法到基准测试中（高N的胜者），以及sparse方法。

以下是另一种获得结果的方法，不使用函数unique或hist，而是依赖于函数sort。

如果您想查看中间步骤的结果，请以扩展形式查看：

A = [4, 12, 9, 8, 9, 12, 7, 1] ;

[B,I] = sort(A) ;    % this will put duplicate elements side by side
df = diff(B) ;       % the duplicates will return '0' when substracted
dx = find(df==0) ;   % find their indices

% Since each duplicate concerns 2 elemts of the array, we add the next
% index for each "flagged" index, taking care not to duplicate the indices
% of sucessive duplicates.
if ~isempty(dx)
    dd = [diff(dx)~=1 , true] ;
    dx = [dx dx(dd)+1] ;
    d = I(dx)   % get the original position of the duplicates in original array
else
    d=[] ;
end

你可以将其压缩为：

[B,I] = sort(A) ;
dx = find(diff(B)==0) ;
if ~isempty(dx)
    d = I([dx dx([diff(dx)~=1,true])+1]) ;
else
    d = [] ;
end

提供：

d =
     3     2     5     6

个人而言，我也会排序返回的索引，但如果不是必须的，而且你关心性能，那么你可以接受未排序的结果。

这里是另一个基准测试（测试元素数量从10到20000）:

运行在MATLAB R2016a上

代码如下:

function ExecTimes = benchmark_findDuplicates

nOrder = (1:9).' * 10.^(1:3) ; nOrder = [nOrder(:) ; 10000 ; 20000 ] ;
npt = numel(nOrder) ;

ExecTimes = zeros(npt,6) ;

for k = 1:npt
    % Sample data
    N = nOrder(k) ;
    A = randi(5000,[1,N]) ;

    % Benchmark
    f1 = @() findDuplicates_histMethod(A) ;
    f2 = @() findDuplicates_histcountMethod(A) ;
    f3 = @() findDuplicates_sortMethod(A) ;
    f4 = @() findDuplicates_expansionMethod(A) ;
    f5 = @() findDuplicates_accumarrayMethod(A) ;
    f6 = @() findDuplicates_sparseMethod(A) ;
    ExecTimes(k,1) = timeit( f1 ) ;
    ExecTimes(k,2) = timeit( f2 ) ;
    ExecTimes(k,3) = timeit( f3 ) ;
    ExecTimes(k,4) = timeit( f4 ) ;
    ExecTimes(k,5) = timeit( f5 ) ;
    ExecTimes(k,6) = timeit( f6 ) ;

    clear A
    disp(N)
end

function d = findDuplicates_histMethod(A)
    U = unique(A);
    [co,ce] = hist(A,U);
    an = ce(co>1);
    d=[];
    for i=1:numel(an)
        d=[d,find(A==an(i))];
    end
end

function d = findDuplicates_histcountMethod(A)
    [~,idxu,idxc] = unique(A);
    [count, ~, idxcount] = histcounts(idxc,numel(idxu));
    idxkeep = count(idxcount)>1;
    idx_A = 1:length(A);
    d = idx_A(idxkeep);
end

function d = findDuplicates_sortMethod(A)
    [B,I] = sort(A) ;
    dx = find(diff(B)==0) ;
    if ~isempty(dx)
        d = I([dx dx([diff(dx)~=1,true])+1]) ;
    else
        d=[];
    end
end

function d = findDuplicates_expansionMethod(A)
    Ae = ones(numel(A),1) * A ;
    d = find(sum(Ae==Ae.')>1) ;
end

function d = findDuplicates_accumarrayMethod(A)
    d = find(ismember(A, find(accumarray(A(:), 1)>1))) ;
end

function d = findDuplicates_sparseMethod(A)
    d = find(ismember(A, find(sparse(A, 1, 1)>1)));
end

end