在Ruby中，大型数组操作非常缓慢。

Question

在Ruby中，大型数组操作非常缓慢。

rubyperformancejruby

10

我有如下场景：

我需要找出一个非常大的集合中唯一的id列表。

例如，我有6000个id数组（关注者列表），每个数组的大小可以在1到25000之间（他们的关注者列表）。

我想获取所有这些id数组中唯一的id列表（关注者的关注者的唯一列表）。完成后，我需要减去另一个id列表（另一个人的关注者列表），并得到最终计数。

最终唯一id集合增长到约为60,000,000条记录。在Ruby中将数组添加到大数组时，在几百万时它开始变得非常缓慢。添加到集合需要开始花费0.1秒，然后在200万时需要超过4秒（远远不足以满足我的需求）。

我在Java中编写了一个测试程序，可以在一分钟内完成整个过程。

也许我在Ruby中实现不够高效，或者还有其他方法。由于我的主要代码是专有的，我编写了一个简单的测试程序来模拟这个问题：

big_array = []
loop_counter = 0
start_time = Time.now
# final target size of the big array
while big_array.length < 60000000
 loop_counter+=1
 # target size of one persons follower list
 random_size_of_followers = rand(5000)
 follower_list = []
 follower_counter = 0
   while follower_counter < random_size_of_followers
     follower_counter+=1
     # make ids very large so we get good spread and only some amt of dupes
     follower_id = rand(240000000) + 100000
     follower_list << follower_id
   end
 # combine the big list with this list
 big_array = big_array | follower_list
 end_time = Time.now

 # every 100 iterations check where we are and how long each loop and combine takes.
 if loop_counter % 100 == 0
   elapsed_time = end_time - start_time
   average_time = elapsed_time.to_f/loop_counter.to_f
   puts "average time for loop is #{average_time}, total size of big_array is #{big_array.length}"
   start_time = Time.now
 end
end

有什么建议吗？是时候转向jruby并将这样的东西移到Java上了吗？

- Joelio

只是想指出你在计时部分有 loop_counter=0。虽然与建议的哈希方法相比，数组访问方法要慢得多，但循环时间实际上并没有增长得那么快。在200万条记录的情况下，循环时间会从初始的0.09秒增长到约0.27秒，增长了三倍。 - Eric Hu

Ruby其实非常快，只是你的方法不对。这真的是一个数据库的使用案例，而不是在任何语言中进行内存数组操作。一个好的DBM可以快速找到不同的值和关联，所有这些都在查询离开数据库之前完成。我会推荐Sequel作为一个很棒的数据库ORM，它将使维护和查询更容易。 - the Tin Man

2个回答

2

以下是处理具有唯一性对象的数组、哈希表和集合的示例:

require 'benchmark'
require 'set'
require 'random_token'

n = 10000

Benchmark.bm(7) do |x|
  x.report("array:") do
    created_tokens = []
    while created_tokens.size < n
      token = RandomToken.gen(10)
      if created_tokens.include?(token)
        next
      else
        created_tokens << token
      end
    end
    results = created_tokens
  end

  x.report("hash:") do
    created_tokens_hash = {}
    while created_tokens_hash.size < n
      token = RandomToken.gen(10)
      created_tokens_hash[token] = true
    end
    results = created_tokens_hash.keys
  end

  x.report("set:") do
    created_tokens_set = Set.new
    while created_tokens_set.size < n
      token = RandomToken.gen(10)
      created_tokens_set << token
    end
    results = created_tokens_set.to_a
  end
end

以及它们的基准：

              user     system      total        real
array:    8.860000   0.050000   8.910000 (  9.112402)
hash:     2.030000   0.010000   2.040000 (  2.062945)
set:      2.000000   0.000000   2.000000 (  2.037125)

参考文献：

Ruby处理唯一对象

- Sibevin Wang

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- tadman · Accepted Answer

你使用的方法非常低效，所以这个过程很慢并不令人惊讶。当你尝试跟踪唯一的事物时，数组需要比哈希表更多的处理。

这是一个简单的重构，可以将速度提高约100倍：

all_followers = { }
loop_counter = 0
start_time = Time.now

while (all_followers.length < 60000000)
  # target size of one persons follower list
  follower_list = []

  rand(5000).times do
    follower_id = rand(240000000) + 100000
    follower_list << follower_id
    all_followers[follower_id] = true
  end

 end_time = Time.now

 # every 100 iterations check where we are and how long each loop and combine takes.
 loop_counter += 1

  if (loop_counter % 100 == 0)
    elapsed_time = end_time - start_time
    average_time = elapsed_time.to_f/loop_counter.to_f
    puts "average time for loop is #{average_time}, total size of all_followers is #{all_followers.length}"
    start_time = Time.now
  end
end

Hash的好处在于它不可能有重复项。如果您需要随时列出所有关注者，请使用all_followers.keys获取ID。

与其等效的数组相比，哈希占用更多内存，但这是您为了性能而付出的代价。我还怀疑这里的一个大内存消耗者是生成许多单独的关注者列表，而似乎从未使用过，因此您可以完全跳过该步骤。

关键在于数组|运算符不太高效，尤其是在操作非常大的数组时。