跨越数十万条记录的数组中计算项目数量

Question

跨越数十万条记录的数组中计算项目数量

ruby-on-railsarraysjsonrubypostgresql

7

我有一个Rails应用程序，其中包含具有jsonb“genres”列的艺术家表的Postgres数据库。

有成千上万行数据。

每行中的genre列都有一个类似于["rock", "indie", "seen live", "alternative", "indie rock"]的数组，其中包含不同的流派。

我想要做的是在所有行中输出JSON格式的每个流派的数量计数。

类似于：{"rock": 532, "power metal": 328, "indie": 862} 是否有一种有效的方法来实现这个目标？

更新...以下是我目前拥有的...

genres = Artist.all.pluck(:genres).flatten.delete_if &:empty?
output = Hash[genres.group_by {|x| x}.map {|k,v| [k,v.count]}]
final = output.sort_by{|k,v| v}.to_h

输出结果是哈希值而不是JSON格式，这是可以接受的。

但是已经感觉相当缓慢，所以我想知道是否有更好的方法来处理它。

- Shpigford

我可能会使用一个新的表（或两个表）来合理地批量存储计数。 - iGian

什么对你来说是有效的？你尝试过什么？ - Juan Carlos Oropeza

我会从JSON中提取数据，使用“group by”和“count”，然后创建你所需的JSON。 - Juan Carlos Oropeza

2

我认为有更好的方法来处理这个问题。使用一个合适的规范化表来存储音乐类型，并使用一个联接表来连接音乐类型和艺术家。然后可以通过查询联接表并按genre_id分组来轻松获得所需结果。这似乎又是一个使用JSON/数组数据类型只会导致糟糕的数据库设计的案例。 - max

@Beartech 我认为在postgres中没有一种真正的方法可以按JSON /数组值进行分组 - 无论如何，您都需要从艺术家表中的每一行中提取每个流派，甚至才能获得完整的流派列表 - 这本身就是一个巨大的问题。 - max

显示剩余4条评论

2个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Beartech · Answer 1

在重新阅读您的问题后，您指出该列是JSONb类型。因此，下面的答案将不起作用，因为您需要首先从jsonb列中获取数组。这应该更好地工作：

output = Artist.connection.select_all('select genre, count (genre) from (select id, JSONB_ARRAY_ELEMENTS(genres) as genre from artists) as foo group by genre;')

=> #<ActiveRecord::Result:0x00007f8ef20df448 @columns=["genre", "count"], @rows=[["\"rock\"", 5], ["\"blues\"", 5], ["\"seen live\"", 3], ["\"alternative\"", 3]], @hash_rows=nil, @column_types={"genre"=>#<ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Jsonb:0x00007f8eeef5d500 @precision=nil, @scale=nil, @limit=nil>, "count"=>#<ActiveModel::Type::Integer:0x00007f8eeeb4c060 @precision=nil, @scale=nil, @limit=nil, @range=-2147483648...2147483648>}> 

output.rows.to_h

=> {"\"rock\""=>5, "\"blues\""=>5, "\"seen live\""=>3, "\"alternative\""=>3}

如评论中所述，如果您可以更改数据库以进行规范化，请尽管去做。在jsonb列中使用匿名数组将来会带来痛苦。如果您需要使用此答案，我建议至少考虑向数据库添加视图，以便您可以获取流派计数作为具有相应模型的表，在Rails中可以在模型定义中轻松创建。

当我认为您的列是Postgres中的常规数组列类型时，这是原始答案。

以下是在Rails中执行它的SQL方法：

genre_count = Artist.connection.select_all('SELECT
                                   UNNEST(genres),
                                   COUNT (UNNEST(genres))
                                  FROM
                                   artists
                                  GROUP BY
                                   UNNEST(genres);')

然后，您可以使用自己选择的方法将较小的数据集转换为JSON。

我对UNNEST不够熟悉，不知道为什么不能像其他列一样给它取别名以使其更美观。但是它确实有效。

http://sqlfiddle.com/#!15/30597/21/0

- max · Answer 2

如果你只是使用一个良好的关系型数据库设计，那么这将是一个非常琐碎的任务:

class Artist < ApplicationRecord
  has_many :artist_genres
  has_many :genres, through: :artist_genres
end

class Genre < ApplicationRecord
  has_many :artist_genres
  has_many :artists, through: :artist_genres
end

class ArtistGenre < ApplicationRecord
  belongs_to :artist 
  belongs_to :genre
end

然后，您可以通过以下方式获取结果：

class Genre < ApplicationRecord
  has_many :artist_genres
  has_many :genres, through: :artist_genres

  # This will instanciate a record for each row just like your average scope
  # and return a ActiveRecord::Relation object.
  def self.with_artist_counts
    self.joins(:artist_genres)
        .select('genres.name, COUNT(artist_genres.id) AS artists_count')
        .group(:id)
  end

  # This pulls the columns as raw sql results and creates a hash with the genre 
  # name as keys
  def self.pluck_artist_counts
    self.connection.select_all(with_artist_counts.to_sql).inject({}) do |hash, row|
      hash.merge(row["name"] => row["artists_count"])
    end
  end
end