Ruby - 如何高效地找到两个非常大的数组之间的差异？

Question

Ruby - 如何高效地找到两个非常大的数组之间的差异？

6

我有一个与效率和算法有关的问题，涉及在找到两个非常大的数组之间的差异时。我希望理解算法的人能指导我如何解决这个问题，因为我的当前实现需要非常长时间才能完成。

问题：

我有两个非常大的数组。其中一个包含无效域名的电子邮件列表，另一个是混合列表，我需要对其进行检查以确定是否存在第一个数组中的内容。

accounts_with_failed_email_domains = [279,000 records in here]

unchecked_account_domains = [149,000 records in here]

我需要做的是遍历unchecked_account_domains列表，然后比较每个条目以查看是否与accounts_with_failed_email_domains中的任何一个匹配。我需要将所有匹配项插入到单独的数组中以便稍后处理。

我应该如何高效地编写代码以快速检查这些帐户？以下是我迄今为止尝试过的方法。

unchecked_account_domains = [really big array]
unchecked_account_domains = unchecked_account_domains.sort

accounts_with_failed_email_domains = [another huge array].sort

unchecked_account_domains.keep_if do |email|
  accounts_with_failed_email_domains.any? { |failed_email| email == failed_email }
end

# Count to see how many accounts are left
puts unchecked_account_domains.count

上述实现已经运行了很长时间。以下是第二次尝试，但它并没有证明会更好。

unchecked_account_domains = [really big array]
unchecked_account_domains = unchecked_account_domains.sort

accounts_with_failed_email_domains = [another huge array].sort

unchecked_account_domains.each do |email|
  accounts_with_failed_email_domains.bsearch do |failed_email| 
     final_check << email if email == failed_email 
  end
end

# Count to see how many accounts are left
puts final_check.count

bsearch 看起来很有前途，但我相当确定我没有正确使用它。另外，我尝试查看这个问题比较大型列表，但这是用python编写的，我似乎找不到 Ruby 的等效方法 set。有人有解决这个问题的想法吗？

- Dan Rubio

http://ruby-doc.org/stdlib-2.3.1/libdoc/set/rdoc/Set.html - גלעד ברקן

4个回答

6

我这里没有新的解决方案，因为好的答案已经被采纳了。然而，我想查看两种基于代码的解决方案之间是否存在性能差异。

这个答案是一个基准测试，旨在突出使用Array#-和两个Set#include?方法的性能差异。第一个Set#include?基准测试总是执行集合转换，而第二个只转换一次并保留集合以供后续搜索使用。

以下是运行每个测试50次的代码：

require 'set'
require 'benchmark'

string = 'asdfghjkl'
Times = 50

a = 279_000.times.map {|n| "#{n}#{string}" }
b = 149_000.times.map {|n| "#{n*2}#{string}" }

puts RUBY_DESCRIPTION
puts "============================================================"
puts "Running tests for trimming strings"

Benchmark.bm(20) do |x|
  x.report("Array#-:")      { Times.times {|n| a - b } }
  x.report("Set#include? #1:") do
    Times.times do |n|
      d = []
      c = Set.new(b)
      a.each {|email| d << email if c.include?(email) }
    end
  end
  x.report("Set#include? #2:") do
    c = Set.new(b)
    Times.times do |n|
      d = []
      a.each {|email| d << email if c.include?(email) }
    end
  end
end

以下是结果：

ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-darwin14]
============================================================
Running tests for trimming strings
                           user     system      total        real
Array#-:              12.350000   0.250000  12.600000 ( 13.001546)
Set#include? #1:      16.090000   0.330000  16.420000 ( 17.196469)
Set#include? #2:       8.250000   0.100000   8.350000 (  8.726609)

很明显，如果你只需要进行一次数组差异比较，使用Array#-方法即可。但是，如果你需要多次执行此类操作，预先将数组转换为集合会产生巨大的差异，并且比Array#-方法性能更好。将数组转换为集合的成本相对较高，但是一旦你拥有了一个集合，它会更快地执行差异比较。

- Michael Gaskill

2

这是对回答中所概述的内容进行很好的分解。 - Anthony

1

如果您知道数组中包含唯一项（或者您不介意丢失重复项 - 我认为您不介意），那么在这里使用Set会很有用，因此只需取出大数组并执行以下操作：

require 'set'
unchecked_account_domains = [really big array]

accounts_with_failed_email_domains = Set.new([another huge array])
final_check = []

  unchecked_account_domains.each do |email| 
     final_check << email if accounts_with_failed_email_domain.include?(email)  # .include? on a set is in O(1) look up time 
  end

- Anthony

0

将失败邮件的数组转换为集合（我认为 Ruby 命令是 .to_set，可以在 Ruby 文档中了解它）。然后使用 .include? 检查每个未检查的电子邮件是否在集合中。

它运行时间过长的原因是每次检查都要遍历整个或大部分列表。集合类应该对列表进行哈希处理，使查询速度更快。

- גלעד ברקן

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ilya · Accepted Answer

6

看起来你可以使用Array#-：

result = unchecked_account_domains - accounts_with_failed_email_domains

- Ilya

如果==是您真正需要比较的选项，那么这是一个不错的方法。 - Anthony

Ruby文档中提到，array-difference的实现使用了与Set类似的哈希技术，因此这应该非常高效。 - גלעד ברקן

谢谢 @ilya。这个方法行得通，但我还是应该看一下算法 :) - Dan Rubio

@DanRubio，欢迎。这只是已经实现的解决方案。如果您对算法更感兴趣，请查看此方法的源代码。 - Ilya