按组计算数据表中特定值的行数

6

我有一个数据表,它看起来有点像这样:

Property    Type
1           apartment
1           office
2           office
2           office
3           apartment
3           apartment
3           office

我现在想按物业类型统计办公室和公寓数量:
Property    Type       number_of_offices    number_of_apartments
       1    apartment                  1                       1
       1    office                     1                       1
       2    office                     2                       0
       2    office                     2                       0
       3    apartment                  1                       2
       3    apartment                  1                       2
       3    office                     1                       2

我尝试过

my.DT <- myDT[,.(Type=Type, number_of_offices=nrow(my.DT[my.DT$Type=="office",]), number_of_apartments=nrow(my.DT$Type=="apparment",], by="Property")

然而,这只给我整个数据表的总计数。有人有解决办法吗?

嗨,laser.p,欢迎来到SO!为了让你的问题更容易回答,你可以遵循这些指南,在R中创建一个好的可重现示例:https://stackoverflow.com/q/49860402/1870254 - jan-glx
5个回答

4
您可以按照以下方式解决问题:
cols <- c("number_of_offices", "number_of_apartments")
df[, (cols) := .(sum(Type == "office"), sum(Type == "apartment")), Property]

# Property      Type number_of_offices number_of_apartments
# 1:        1 apartment                 1                    1
# 2:        1    office                 1                    1
# 3:        2    office                 2                    0
# 4:        2    office                 2                    0
# 5:        3 apartment                 1                    2
# 6:        3 apartment                 1                    2
# 7:        3    office                 1                    2

3

您想将分组计数与my.df合并的原因是什么?

您可以尝试这个,它将按PropertyType进行分组,并给出计数。然后再与原始的myDT合并:

grouped = myDT[, .N, by=c('Property','Type')]
myDT = merge(myDT, grouped[Type == 'apartment', list(Property,N)], by='Property', all.x=TRUE)
myDT = merge(myDT, grouped[Type == 'office', list(Property,N)], by='Property', all.x=TRUE)
setnames(myDT, c('N.x','N.y'), c('Number of appartments','Number of offices'))
myDT[is.na(myDT)] <- 0

> myDT
   Property      Type Number of appartments Number of offices
1:        1 apartment                     1                 1
2:        1    office                     1                 1
3:        2    office                     0                 2
4:        2    office                     0                 2
5:        3 apartment                     2                 1
6:        3 apartment                     2                 1
7:        3    office                     2                 1

谢谢您的回答!这会产生正确的数字,这已经是一步向前了。将它与我的.df合并的原因是,我之后想要将它与一个更大的列表合并。该列表每个属性有一行,并包含许多包含信息的列(例如办公室和公寓数量等)。 - laser.p
我稍微编辑了答案。我添加了一个左连接以获得所需的结果。 - Arturo Sbr

0

你的方法是一个不错的开始!

my.DT <- myDT[,.(Type=Type, number_of_offices=nrow(my.DT[my.DT$Type=="office",]), number_of_apartments=nrow(my.DT$Type=="apparment",], by="Property")

括号不匹配,我猜测是复制粘贴时出现了一些错误。这些错误可以通过遵循如何创建一个完美的可重现示例和使用reprex包来有效地避免。你的问题中的代码可能看起来像这样:
library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.2

rooms_dt <- fread(text =
"Property    Type
1           apartment
1           office
2           office
2           office
3           apartment
3           apartment
3           office
")
rooms_dt[, .(
  Type=Type, 
  number_of_offices=nrow(rooms_dt[rooms_dt$Type=="office",]), 
  number_of_apartments=nrow(rooms_dt[rooms_dt$Type=="apartment",])
), by="Property"]
#>    Property      Type number_of_offices number_of_apartments
#> 1:        1 apartment                 4                    3
#> 2:        1    office                 4                    3
#> 3:        2    office                 4                    3
#> 4:        2    office                 4                    3
#> 5:        3 apartment                 4                    3
#> 6:        3 apartment                 4                    3
#> 7:        3    office                 4                    3

reprex 包(v0.3.0)于2020-03-12创建

正如您注意到的那样,该方法的问题在于对于每个组(=每个Property),都会评估用于每个组的 j 表达式,计算出所有公寓/办公室行的数量,而不仅仅是当前 Property 的数量。这可以通过不仅针对正确类型而且还针对正确的属性进行子集划分(在j中作为特殊变量.BY可用)来避免。这已经得到了所需的结果:

rooms_dt[, .(
  Type=Type, 
  number_of_offices=nrow(rooms_dt[rooms_dt$Property == .BY & rooms_dt$Type=="office",]), 
  number_of_apartments=nrow(rooms_dt[rooms_dt$Property == .BY & rooms_dt$Type=="apartment",])
), by="Property"]
#>    Property      Type number_of_offices number_of_apartments
#> 1:        1 apartment                 1                    1
#> 2:        1    office                 1                    1
#> 3:        2    office                 2                    0
#> 4:        2    office                 2                    0
#> 5:        3 apartment                 1                    2
#> 6:        3 apartment                 1                    2
#> 7:        3    office                 1                    2

然而,这个解决方案可以被简化很多: 首先,data.table的列可以直接通过它们的名称在[]中访问:

rooms_dt[, .(
  Type=Type,
  number_of_offices=nrow(rooms_dt[Property == .BY & Type=="office",]),
  number_of_apartments=nrow(rooms_dt[Property == .BY & Type=="apartment",])
), by="Property"]

此外,data.table 提供了将当前数据表子集化到当前组的 .SD,从而使其进一步简化。
rooms_dt[, .(
  Type=Type,
  number_of_offices=nrow(.SD[Type=="office",]),
  number_of_apartments=nrow(.SD[Type=="apartment",])
), by="Property"]

由于不仅.SD被子集化,而且所有列都被子集化,因此直接使用它们更加简单:

rooms_dt[, .(
  Type = Type,
  number_of_offices = sum(Type=="office"),
  number_of_apartments = sum(Type=="apartment")
), by="Property"]

data.table 的一个优点是,与大多数 R 相比,它们是可变的。这使得可以直接将新列添加到旧的 data.table 中,而无需创建新的表:


rooms_dt[, number_of_offices := sum(Type=="office"), by="Property"]
rooms_dt[, number_of_apartments := sum(Type=="apartment"), by="Property"]

这也可以写成:

rooms_dt[, c("number_of_offices", "number_of_apartments") := list(sum(Type=="office"), sum(Type=="apartment")), by="Property"]

或者作为:

rooms_dt[, `:=`(number_of_offices = sum(Type=="office"), number_of_apartments = sum(Type=="apartment")), by="Property"]

这并不像可能的那样容易,可能是因为它很少需要,通常更希望有其中之一:

rooms_dt[, count_in_property := .N, by=.(Property, Type)]
rooms_dt
#>    Property      Type number_in_property
#> 1:        1 apartment                  1
#> 2:        1    office                  1
#> 3:        2    office                  2
#> 4:        2    office                  2
#> 5:        3 apartment                  2
#> 6:        3 apartment                  2
#> 7:        3    office                  1

或者

dcast(rooms_dt[, .(number_of = .N), by=.(Property, Type)], Property ~ Type, fill = 0)
#> Using 'number_of' as value column. Use 'value.var' to override
#>    Property apartment office
#> 1:        1         1      1
#> 2:        2         0      2
#> 3:        3         2      1

0

这是使用data.table处理两个以上房间类型的方法,我认为:

dcast(
  rooms_dt[, .(number_of=.N), by=.(Property, Type)], 
  Property ~ Type, 
  fill=0
)[rooms_dt, on=.(Type)]

完整示例:
library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.2
rooms_dt <- fread(text =
                    "Property    Type
1           apartment
1           office
2           office
2           office
3           apartment
3           apartment
3           office
")

rooms_summary_dt <- rooms_dt[, .(number_rooms_of_type_at_property = .N), by=.(Property, Type)]
rooms_summary_dt
#>    Property      Type number_rooms_of_type_at_property
#> 1:        1 apartment                                1
#> 2:        1    office                                1
#> 3:        2    office                                2
#> 4:        3 apartment                                2
#> 5:        3    office                                1

properties_dt <- dcast(rooms_summary_dt, Property ~ Type, fill = 0)
#> Using 'number_rooms_of_type_at_property' as value column. Use 'value.var' to override
properties_dt
#>    Property apartment office
#> 1:        1         1      1
#> 2:        2         0      2
#> 3:        3         2      1

properties_dt[rooms_dt, on=.(Property)]
#>    Property apartment office      Type
#> 1:        1         1      1 apartment
#> 2:        1         1      1    office
#> 3:        2         0      2    office
#> 4:        2         0      2    office
#> 5:        3         2      1 apartment
#> 6:        3         2      1 apartment
#> 7:        3         2      1    office

2020年3月12日创建,使用reprex package(v0.3.0)


-1

这里是 dplyr/tidyr 的解决方案,

library(dplyr)
library(tidyr)

df %>% 
 group_by_all() %>% 
 count() %>% 
 pivot_wider(names_from = Type, values_from = n) %>% 
 left_join(., df, by = 'Property')

这将会给出:

# A tibble: 7 x 4
# Groups:   Property [3]
  Property apartment office Type     
     <int>     <int>  <int> <fct>    
1        1         1      1 apartment
2        1         1      1 office   
3        2        NA      2 office   
4        2        NA      2 office   
5        3         2      1 apartment
6        3         2      1 apartment
7        3         2      1 office 

数据

dput(df)
structure(list(Property = c(1L, 1L, 2L, 2L, 3L, 3L, 3L), Type = structure(c(1L, 
2L, 2L, 2L, 1L, 1L, 2L), .Label = c("apartment", "office"), class = "factor")), class = "data.frame", row.names = c(NA, 
-7L))

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接