Ruby：比较两个哈希数组

Question

Ruby：比较两个哈希数组

3

我是一名 Ruby 新手（使用的是 1.9.1 版本），希望能得到帮助。我通过谷歌学到了关于 Ruby 的一切。我正在尝试比较两个哈希数组，但由于它们的大小，这个过程太慢了，而且很容易耗尽内存。如果有任何帮助，将不胜感激。

我有一个类（ParseCSV），其中包含多个方法（initialize、open、compare、strip、output）。目前的工作方式如下（并且已经通过我编写的测试，只是使用了更小的数据集）：


file1 = ParseCSV.new(“some_file”)
file2 = ParseCSV.new(“some_other_file”)

file1.open #this reads the file contents into an Array of Hash’s through the CSV library 
file1.strip #This is just removing extra hash’s from each array index.  So normally there are fifty hash’s in each array index, this is just done to help reduce memory consumption.  

file2.open 
file2.compare(“file1.storage”) #@storage is The array of hash’s from the open method

file2.output

现在我遇到的问题是比较方法。在处理小数据集时，这并不是什么大问题，速度足够快。然而，在这种情况下，我正在将约400,000条记录（全部读入哈希数组）与大约450,000条记录的一个文件进行比较。我正在尝试加快速度。另外，我无法对file2运行strip方法。以下是我目前的做法：


def compare(x)
    #obviously just a verbose message
    puts "Comparing and leaving behind non matching entries"

    x.each do |row|
        #@storage is the array of hashes
        @storage.each_index do |y|       
            if row[@opts[:field]] == @storage[y][@opts[:field]]
               @storage.delete_at(y)
            end
       end
    end
end

希望这样说得通。我知道这将是一个缓慢的过程，因为它必须迭代400,000行440,000次。但是您有没有其他想法来加速并可能减少内存消耗？

- BigForNothing

2个回答

1

这是一个比较两种做法的脚本：你的原始compare()和一个new_compare()。new_compare使用更多内置的Enumerable方法。由于它们是用C实现的，所以速度会更快。

我创建了一个名为Test :: SIZE的常量，以尝试在不同的哈希大小下进行基准测试。结果在底部。差异非常大。

require 'benchmark'

class Test
  SIZE = 20000
  attr_accessor :storage
  def initialize
    file1 = []
    SIZE.times { |x| file1 << { :field => x, :foo => x } }
    @storage = file1
    @opts = {}
    @opts[:field] = :field
  end

  def compare(x)
    x.each do |row|
      @storage.each_index do |y|
        if row[@opts[:field]] == @storage[y][@opts[:field]]
          @storage.delete_at(y)
        end
      end
    end
  end

  def new_compare(other)
    other_keys = other.map { |x| x[@opts[:field]] }
    @storage.reject! { |s| other_keys.include? s[@opts[:field]] }
  end

end

storage2 = []
# We'll make 10 of them match
10.times { |x| storage2 << { :field => x, :foo => x } }
# And the rest wont
(Test::SIZE-10).times { |x| storage2 << { :field => x+100000000, :foo => x} }

Benchmark.bm do |b|
  b.report("original compare") do
    t1 = Test.new
    t1.compare(storage2)
  end
end

Benchmark.bm do |b|
  b.report("new compare") do
    t1 = Test.new
    t1.new_compare(storage2)
  end
end

结果：

Test::SIZE = 500
      user     system      total        real
original compare  0.280000   0.000000   0.280000 (  0.285366)
      user     system      total        real
new compare  0.020000   0.000000   0.020000 (  0.020458)

Test::SIZE = 1000
     user     system      total        real
original compare 28.140000   0.110000  28.250000 ( 28.618907)
      user     system      total        real
new compare  1.930000   0.010000   1.940000 (  1.956868)

Test::SIZE = 5000
ruby test.rb
      user     system      total        real
original compare113.100000   0.440000 113.540000 (115.041267)
      user     system      total        real
new compare  7.680000   0.020000   7.700000 (  7.739120)

Test::SIZE = 10000
      user     system      total        real
original compare453.320000   1.760000 455.080000 (460.549246)
      user     system      total        real
new compare 30.840000   0.110000  30.950000 ( 31.226218)

- hgmnz

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- madlep · Accepted Answer

哎呀，这将会是O(n^2)的运行时间。很糟糕。

更好的选择是使用内置的Set类。

代码看起来就像:

require 'set'

file1_content = load_file_content_into_array_here("some_file")
file2_content = load_file_content_into_array_here("some_other_file")

file1_set = Set[file1_content]

unique_elements = file1_set - file2_content

假设文件本身具有独特内容。在一般情况下应该能够工作，但根据您的数据长相和解析方式可能会有怪异的问题，但只要可以使用 == 比较行，它应该能够帮助您。

使用 set 将比循环迭代文件内容快得多。

(是的，我确实使用这种方法处理了大约200万行的文件，所以它应该能够处理您的情况 - 最终。如果您正在进行大量的数据操作，Ruby 可能不是最好的工具选择)