在Ruby数组中删除重复项。

Question

在Ruby数组中删除重复项。

arraysruby

3

我可以很容易地使用.uniq在数组中删除重复项，但如果不使用.uniq方法，该怎么做呢？

- Dylan Bailey

你能加上你想知道的原因吗？只是好奇吗？ - Schwern

1

嗨！我目前正在学习Ruby/Web开发相关的知识，所以我遇到了一些挑战，其中这个问题让我感到非常困扰。 - Dylan Bailey

它不会移除唯一值，而是保留了唯一值。 - sawa

为什么急于选择答案？ - Cary Swoveland

5个回答

4

大多数 Ruby 方法的代码可以在 ruby-doc.org API documentation 中找到。如果将鼠标悬停在方法文档上，会出现一个“单击切换源代码”的按钮。代码是用 C 写的，但很容易理解。

if (RARRAY_LEN(ary) <= 1)
    return rb_ary_dup(ary);

if (rb_block_given_p()) {
    hash = ary_make_hash_by(ary);
    uniq = rb_hash_values(hash);
}
else {
    hash = ary_make_hash(ary);
    uniq = rb_hash_values(hash);
}

如果只有一个元素，就返回它。否则将元素转换为哈希键，然后将哈希转换回数组。由于 Ruby 哈希的一个已知特性，即“哈希按照对应键插入的顺序枚举其值”，这种技术可以保留数组中元素的原始顺序。在其他语言中可能不是这样。

或者，使用 Set。集合永远不会有重复项。加载 set 将把方法 to_set 添加到所有 Enumerable 对象中，包括数组。然而，Set 通常被实现为一个哈希表，因此你正在做同样的事情。如果你想要一个唯一的数组，并且不需要元素被排序，那么你应该创建一个集合并使用它。 unique = array.to_set

或者，对数组进行排序并循环遍历，将每个元素推入新的数组中。如果新数组上的最后一个元素与当前元素匹配，则将其丢弃。

array = [2, 3, 4, 5, 1, 2, 4, 5];
uniq = []

# This copies the whole array and the duplicates, wasting
# memory.  And sort is O(nlogn).
array.sort.each { |e|
  uniq.push(e) if e != uniq[-1]
}

[1, 2, 3, 4, 5]
puts uniq.inspect

这种方法应该避免使用，因为它比其他方法更慢且占用更多的内存。排序使其变慢。排序是O(nlogn)，这意味着随着数组变得越来越大，排序速度会比数组增长更快地变慢。除非您想通过sort!就地排序以更改原始数据，否则它还要求您复制整个数组及其重复项。

其他方法的速度和内存消耗都是O(n)，这意味着它们会随着数组的增长而线性扩展。并且它们不必复制重复项，这可以大大节省内存。

- Schwern

1

谢谢您的详细描述！非常有帮助 :) - Dylan Bailey

我的意思是将我的评论限制在首先对数组进行排序的方法上（顺便说一句，这非常快）。我相信使用集合可以保留顺序，原因与哈希表相同。例如，require 'set'; a = 0.upto(100_000).to_a; b = 50_000.times.map { a.sample }; b.to_set.to_a == b.uniq #=> true. - Cary Swoveland

@CarySwoveland 我之前没有意识到 Ruby 哈希表可以保留顺序，直到你的评论。至于集合的顺序，不应该依赖它。Set#to_a文档中写道：“元素的顺序是不确定的。”至于排序技术的速度，它的时间复杂度为O(nlogn)。虽然在处理小数组时可能比哈希表实现更快，但随着数组增长，其性能迅速下降。在10个元素时，Array#uniq和排序的速度相同。在100个元素时，排序已经慢了3倍。在1000个元素时，它慢了6倍。使用集合也很慢。http://pastebin.com/2QaLc6Jn - Schwern

@CarySwoveland，这与大O分析或我的基准测试不符。你能发布一下你的基准测试吗？你是否包括排序？还是假设数组已经预先排序了？ - Schwern

这不是一个“记录的怪癖”，而是 Ruby Hash 规范中有意为之的一部分。部分原因是它让你可以方便地做像这样的事情。如果你使用 Set 而不是 Hash，它就不会保留顺序，或者至少不能保证，所以不要使用 Set。Hashes 保证保留顺序，这不是一个记录的怪癖，而是一个记录的保证。Sets 不是，如果它们碰巧这样做，那么它确实是一个“怪癖”，不应该依赖它。 - jrochkind

显示剩余3条评论

3

您可以使用 #to_set。在此处阅读更多相关内容：这里。

- Mihail Petkov

好建议，但为什么不发布一个完整的答案呢？ - Cary Swoveland

类似的操作可以这样实现：Set.new [1,2,3,3,3,4,4]，这会返回一个包含唯一对象的集合 Set。 - Raman

2

array.group_by(&:itself).keys

......................

- sawa

你如何在不使用哈希的情况下完成它？ - Damiano Stoffie

@DamianoStoffie 我在我的回答中添加了一个不使用哈希的方法，但它速度较慢且占用更多内存。 - Schwern

1

你也可以尝试这个，查看以下示例。

a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]

b = []

a.each{ |aa| b << aa unless b.include?(aa) }

# when you check b you will get following result.

[1, 2, 4, 3, 5, 6]

另外，您也可以尝试以下操作。

a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]

b = a & a

# OR

b = a | a

# both will return following result

[1, 2, 4, 3, 5, 6]

- Amit Sharma

这非常慢，O(n^2)，这意味着如果数组大小加倍，则运行时间增加4倍。问题在于 b.include?(aa)。每次循环都必须潜在地搜索整个 b。内部有 O(n) 操作的循环是 O(n^2)。这就是为什么最好将 b 设为哈希表。哈希查找是 O(1)，并且它们不能有重复项。经验法则：每当编写 array.include? 时，请问自己“我能否使用哈希表？” - Schwern

是的，反复调用 include? 确实是浪费的。你应该问自己其中一个问题。一个由 @Schwern 给出。另一个是：“我应该创建一个集合然后将其转换为数组吗？” - Cary Swoveland

@CarySwoveland Ruby的set是围绕哈希的接口。避免使用set接口开销，直接使用哈希可能会更快，但它们都是O(n)解决方案。如果你只是要丢弃set，我会使用哈希。如果你要保留唯一的数组，我不会费心去处理数组，而是制作和使用一个set。 - Schwern

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Cary Swoveland · Accepted Answer

a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]

class Array
  def my_uniq
    self | []
  end
end

a.my_uniq
  #=> [1, 2, 4, 3, 5, 6]

这里使用了方法Array#|："集合并 - 通过将ary与other_ary连接，排除任何重复并保留原始数组的顺序返回一个新数组。"

这是各种答案及Array＃uniq的基准测试。

require 'fruity'
require 'set'

def doit(n, m)
  arr = n.times.to_a
  arr = m.times.map { arr.sample }
  compare do
    uniq     { arr.uniq } 
    Schwern  { uniq = []; arr.sort.each { |e| uniq.push(e) if e != uniq[-1]; uniq } }
    Sharma   {b = []; arr.each{ |aa| b << aa unless b.include?(aa) }; b }
    Mihael   { arr.to_set.to_a }
    sawa     { arr.group_by(&:itself).keys }
    Cary     { arr | [] }
  end
end

doit(1_000, 500)
# Schwern is faster than uniq by 19.999999999999996% ± 10.0% (results differ)
# uniq is similar to Cary
# Cary is faster than Mihael by 10.000000000000009% ± 10.0%
# Mihael is similar to sawa
# sawa is faster than Sharma by 5x ± 0.1

doit(100_000, 50_000)
# Schwern is faster than uniq by 50.0% ± 10.0%               (results differ)
# uniq is similar to Cary
# Cary is similar to Mihael
# Mihael is faster than sawa by 10.000000000000009% ± 10.0%
# sawa is faster than Sharma by 310x ± 10.0

"Schwern"和"uniq"返回包含相同元素但顺序不同的数组（因此"结果不同"）。

这是@Schern要求的额外基准测试。

def doit1(n)
  arr = n.times.map { rand(n/10) }
  compare do
    uniq     { arr.uniq } 
    Schwern  { uniq = []; arr.sort.each { |e| uniq.push(e) if e != uniq[-1]; uniq } }
    Sharma   {b = []; arr.each{ |aa| b << aa unless b.include?(aa) }; b }
    Mihael   { arr.to_set.to_a }
    sawa     { arr.group_by(&:itself).keys }
    Cary     { arr | [] }
  end
end

doit1(1_000)
# Cary is similar to uniq
# uniq is faster than sawa by 3x ± 1.0
# sawa is similar to Schwern                     (results differ)
# Schwern is similar to Mihael                   (results differ)
# Mihael is faster than Sharma by 2x ± 0.1

doit1(50_000)
# Cary is similar to uniq
# uniq is faster than Schwern by 2x ± 1.0        (results differ)
# Schwern is similar to Mihael                   (results differ)
# Mihael is similar to sawa
# sawa is faster than Sharma by 62x ± 10.0