处理大型数据集和Ruby技术

3

非常需要帮助。困扰于显示大量数据的仪表板。

平均处理2k条记录需要2秒钟。

在MySql控制台中查询150k行只需要不到3.5秒钟。但是在Ruby中,从执行查询到所有对象准备好,需要4多分钟的时间。

目标:在添加缓存服务器之前进一步优化数据。使用Ruby 1.9.2、Rails 3.0和Mysql (Mysql2 gem)。

问题:

  • 使用Hash会影响性能吗?
  • 首先将所有内容放入一个主要的Hash中,然后再操作所需的数据是否更好?
  • 还有其他可以提高性能的方法吗?

数据库中的行数:

  • GasStations和US Census共有约150,000条记录
  • Person共有约100,000条记录
  • Cars共有约200,000条记录
  • FillUps共有约2.3百万条记录

仪表板所需内容(基于过去24小时、上周等时间段的查询)。所有数据以JSON格式返回给JS。

  • 加油站,包括FillUps和US Census数据(邮编、名称、城市、人口)
  • 最多加油的20个城市
  • 加油次数最多的10辆车
  • 按加油次数分组的汽车

代码(6个月的样本。返回100k +记录):

# for simplicity, removed the select clause I had, but removing data I don't need like updated_at, gas_station.created_at, etc. instead of returning all the columns for each table.
@primary_data = FillUp.includes([:car, :gas_staton, :gas_station => {:uscensus}]).where('fill_ups.created_at >= ?', 6.months.ago) # This would take @ 4 + minutes

# then tried

@primary_data = FillUp.find_by_sql('some long sql query...') # took longer than before.
# Note for others, sql query did some pre processing for me which added attributes to the return.  Query in DB Console took < 4 seconds.  Because of these extra attributes, query took longer as if Ruby was checking each row for mapping attributes

# then tried

MY_MAP = Hash[ActiveRecord::Base.connection.select_all('SELECT thingone, thingtwo from table').map{|one| [one['thingone'], one['thingtwo']]}] as seen http://stackoverflow.com/questions/4456834/ruby-on-rails-storing-and-accessing-large-data-sets
# that took 23 seconds and gained mapping of additional data that was processing later, so much faster

# currently using below which takes @ 10 seconds
# All though this is faster, query still only takes 3.5 seconds, but parsing it to the hashes does add overhead.
cars = {}
gasstations = {}
cities = {}
filled = {}

client = Mysql2::Client.new(:host => "localhost", :username => "root")
client.query("SELECT sum(fill_ups_grouped_by_car_id) as filled, fillups.car_id, cars.make as make, gasstations.name as name,  ....", :stream => true, :as => :json).each do |row|
  # this returns fill ups gouged by car ,fill_ups.car_id, car make, gas station name, gas station zip, gas station city, city population 
  if cities[row['city']]
    cities[row['city']]['fill_ups']  = (cities[row['city']]['fill_ups']  + row['filled'])
  else
    cities[row['city']] = {'fill_ups' => row['filled'], 'population' => row['population']}
  end
  if gasstations[row['name']]
    gasstations[row['name']]['fill_ups'] = (gasstations[row['name']]['fill_ups'] + row['filled'])
  else
    gasstations[row['name']] = {'city' => row['city'],'zip' => row['city'], 'fill_ups' => row['filled']}
  end
  if cars[row['make']]
    cars[row['make']] = (cars[row['make']] + row['filled'])
  else
    cars[row['make']] = row['filled']
  end
  if row['filled']
    filled[row['filled']] = (filled[row['filled']] + 1)
  else
    filled[row['filled']] = 1
  end
end

以下是几种模型:

def Person
 has_many :cars 
end

def Car
  belongs_to :person
  belongs_to :uscensus, :foreign_key => :zipcode, :primary_key => :zipcode
  has_many :fill_ups
  has_many :gas_stations, :through => :fill_ups
end

def GasStation
  belongs_to :uscensus, :foreign_key => :zipcode, :primary_key => :zipcode
  has_many :fill_ups
  has_many :cars, :through => :fill_ups
end

def FillUp
  # log of every time a person fills up there gas
  belongs_to :car
  belongs_to :gas_station
end

def Uscensus
  # Basic data about area based on Zip code
end

这里有点挑剔:CarGasStationFillUp。在我看来,Ruby 的约定更好:cargas_stationfill_up - Zabba
有必要先返回整个数据集,还是像http://apidock.com/rails/ActiveRecord/Batches/ClassMethods/find_in_batches这样的方式? - engineerDave
不行,可以使用find_in_batches方法,但是Ruby方法非常慢。会尝试并报告结果。 - pcasa
关于“更好的可读性”,我同意,但性能是关键。 - pcasa
好的,我在美国人口普查表中进行了一个快速测试,提取了149,318行的id、城市和fipscode数据。使用标准的Ruby方法,耗时6.144613秒;使用分批处理方法,耗时6.642666秒;使用Mysql GEM原始查询方法,耗时1.851102秒。 - pcasa
为了进一步提高性能,将 :as => :json 更改为 :as => :array,这有所帮助。 - pcasa
1个回答

2

我不使用RoR,但是返回100k行数据至仪表盘上总会很慢。我强烈建议在呈现前,在数据库中建立或维护汇总表并运行GROUP BY以对数据集进行汇总。


我们尝试过“卷起”表格,但由于我们每15分钟写入大块新数据,所以所有卷起都需要每15分钟更新,这开始变得混乱。 - pcasa
不过,每15分钟更新一次可能比每次更新仪表板更好。无论你做什么,我建议尽量在数据库端保持汇总。 - Joshua Martell

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接