我可以很容易地使用.uniq
在数组中删除重复项,但如果不使用.uniq
方法,该怎么做呢?
a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]
class Array
def my_uniq
self | []
end
end
a.my_uniq
#=> [1, 2, 4, 3, 5, 6]
Array#uniq
的基准测试。require 'fruity'
require 'set'
def doit(n, m)
arr = n.times.to_a
arr = m.times.map { arr.sample }
compare do
uniq { arr.uniq }
Schwern { uniq = []; arr.sort.each { |e| uniq.push(e) if e != uniq[-1]; uniq } }
Sharma {b = []; arr.each{ |aa| b << aa unless b.include?(aa) }; b }
Mihael { arr.to_set.to_a }
sawa { arr.group_by(&:itself).keys }
Cary { arr | [] }
end
end
doit(1_000, 500)
# Schwern is faster than uniq by 19.999999999999996% ± 10.0% (results differ)
# uniq is similar to Cary
# Cary is faster than Mihael by 10.000000000000009% ± 10.0%
# Mihael is similar to sawa
# sawa is faster than Sharma by 5x ± 0.1
doit(100_000, 50_000)
# Schwern is faster than uniq by 50.0% ± 10.0% (results differ)
# uniq is similar to Cary
# Cary is similar to Mihael
# Mihael is faster than sawa by 10.000000000000009% ± 10.0%
# sawa is faster than Sharma by 310x ± 10.0
"Schwern"和"uniq"返回包含相同元素但顺序不同的数组(因此"结果不同")。
这是@Schern要求的额外基准测试。
def doit1(n)
arr = n.times.map { rand(n/10) }
compare do
uniq { arr.uniq }
Schwern { uniq = []; arr.sort.each { |e| uniq.push(e) if e != uniq[-1]; uniq } }
Sharma {b = []; arr.each{ |aa| b << aa unless b.include?(aa) }; b }
Mihael { arr.to_set.to_a }
sawa { arr.group_by(&:itself).keys }
Cary { arr | [] }
end
end
doit1(1_000)
# Cary is similar to uniq
# uniq is faster than sawa by 3x ± 1.0
# sawa is similar to Schwern (results differ)
# Schwern is similar to Mihael (results differ)
# Mihael is faster than Sharma by 2x ± 0.1
doit1(50_000)
# Cary is similar to uniq
# uniq is faster than Schwern by 2x ± 1.0 (results differ)
# Schwern is similar to Mihael (results differ)
# Mihael is similar to sawa
# sawa is faster than Sharma by 62x ± 10.0
Array#uniq
相同。Array#|
将第一个数组放入哈希表中,循环遍历第二个数组的元素,如果它们不存在,则添加它们,然后返回哈希表的键。由于第二个数组为空,这只是将self
放入哈希表并返回键,就像Array#uniq
一样。 - Schwernarr = m.times.map { rand(m/10) }
。 - Schwernif (RARRAY_LEN(ary) <= 1)
return rb_ary_dup(ary);
if (rb_block_given_p()) {
hash = ary_make_hash_by(ary);
uniq = rb_hash_values(hash);
}
else {
hash = ary_make_hash(ary);
uniq = rb_hash_values(hash);
}
set
将把方法 to_set
添加到所有 Enumerable 对象中,包括数组。然而,Set 通常被实现为一个哈希表,因此你正在做同样的事情。如果你想要一个唯一的数组,并且不需要元素被排序,那么你应该创建一个集合并使用它。 unique = array.to_set
或者,对数组进行排序并循环遍历,将每个元素推入新的数组中。如果新数组上的最后一个元素与当前元素匹配,则将其丢弃。
array = [2, 3, 4, 5, 1, 2, 4, 5];
uniq = []
# This copies the whole array and the duplicates, wasting
# memory. And sort is O(nlogn).
array.sort.each { |e|
uniq.push(e) if e != uniq[-1]
}
[1, 2, 3, 4, 5]
puts uniq.inspect
sort!
就地排序以更改原始数据,否则它还要求您复制整个数组及其重复项。require 'set'; a = 0.upto(100_000).to_a; b = 50_000.times.map { a.sample }; b.to_set.to_a == b.uniq #=> true
. - Cary SwovelandSet#to_a
文档中写道:“元素的顺序是不确定的。”至于排序技术的速度,它的时间复杂度为O(nlogn)。虽然在处理小数组时可能比哈希表实现更快,但随着数组增长,其性能迅速下降。在10个元素时,Array#uniq
和排序的速度相同。在100个元素时,排序已经慢了3倍。在1000个元素时,它慢了6倍。使用集合也很慢。http://pastebin.com/2QaLc6Jn - SchwernSet
而不是 Hash
,它就不会保留顺序,或者至少不能保证,所以不要使用 Set。Hashes 保证保留顺序,这不是一个记录的怪癖,而是一个记录的保证。Sets 不是,如果它们碰巧这样做,那么它确实是一个“怪癖”,不应该依赖它。 - jrochkind#to_set
。在此处阅读更多相关内容:这里。Set.new [1,2,3,3,3,4,4]
,这会返回一个包含唯一对象的集合 Set
。 - Ramanarray.group_by(&:itself).keys
......................
a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]
b = []
a.each{ |aa| b << aa unless b.include?(aa) }
# when you check b you will get following result.
[1, 2, 4, 3, 5, 6]
a = [1, 1, 1, 2, 4, 3, 4, 3, 2, 5, 5, 6]
b = a & a
# OR
b = a | a
# both will return following result
[1, 2, 4, 3, 5, 6]
b.include?(aa)
。每次循环都必须潜在地搜索整个 b
。内部有 O(n) 操作的循环是 O(n^2)。这就是为什么最好将 b
设为哈希表。哈希查找是 O(1)
,并且它们不能有重复项。经验法则:每当编写 array.include?
时,请问自己“我能否使用哈希表?” - Schwerninclude?
确实是浪费的。你应该问自己其中一个问题。一个由 @Schwern 给出。另一个是:“我应该创建一个集合然后将其转换为数组吗?” - Cary SwovelandO(n)
解决方案。如果你只是要丢弃set,我会使用哈希。如果你要保留唯一的数组,我不会费心去处理数组,而是制作和使用一个set。 - Schwern