在R中计算唯一值的累积数量

Question

在R中计算唯一值的累积数量

21

我数据集的简化版本如下：

depth value
   1     a
   1     b
   2     a
   2     b
   2     b
   3     c

我想创建一个新数据集，对于“depth”中的每个值，我都想从顶部开始累积不同值的数量。

depth cumsum
 1      2
 2      2
 3      3

你有什么想法怎样做这个？我对R还比较新。

- user2223405

6个回答

12

一次dplyr尝试。

df %>%
  #group_by(group)%>% # if you have a third variable and you want to achieve the same results for each group
  mutate(cum_unique_entries = cumsum(!duplicated(value))) %>%
  group_by(depth) %>% # add group variable for more layers
  summarise(cum_unique_entries = last(cum_unique_entries))

- MLE

1

这对我的问题非常有效，感谢您的答案！ - Andrew Brēza

8

这里是另一次尝试：

numvals <- cummax(as.numeric(factor(mydf$value)))
aggregate(numvals, list(depth=mydf$depth), max)

这将会得到：

似乎@Arun的示例也可以使用:

这个示例似乎也适用于@Arun:

- juba

1

我不完全确定，但似乎 depth 和 value 都必须同时排序。例如，无论如何你如何设置setkey()这个data.table，此方法都不会计算c的唯一出现次数：mydf = data.table(data.frame(depth=c(1,1,2,2,6,7), value=c("a", "b", "g", "h", "b", "c")))。 - ecoe

7

首先要做的是创建一列TRUE或FALSE的数据，当值第一次出现时为TRUE，之后出现则为FALSE。这可以很容易地使用duplicated函数来完成：

mydata$first.appearance = !duplicated(mydata$value)

使用aggregate最好来重塑数据。在这种情况下，它指定在每个depth子集中对first.appearance列进行求和：

newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum)

结果将会如下所示：

  depth first.appearance
1     1  2
2     2  0
3     3  1

不过，这仍然不是累积总和。为此，您可以使用 cumsum 函数（然后摆脱旧列）：

newdata$cumsum = cumsum(newdata$first.appearance)
newdata$first.appearance = NULL

所以，简而言之：

mydata$first.appearance = !duplicated(mydata$value)
newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum)
newdata$cumsum = cumsum(newdata$first.appearance)
newdata$first.appearance = NULL

输出：

  depth cumsum
1     1      2
2     2      2
3     3      3

- David Robinson

不错！你可以将它与 data.table 结合使用 DT[, .(depth, unique.count = cumsum(!duplicated(value)))][, .(cumsum = max(unique.count)), by = .(depth)] - Ethan

5

使用sqldf包，可以用一条相对简洁的SQL语句编写。假设DF是原始数据框：

library(sqldf)

sqldf("select b.depth, count(distinct a.value) as cumsum
    from DF a join DF b 
    on a.depth <= b.depth
    group by b.depth"
)

- G. Grothendieck

假设 depth 是数字，那么这非常有用。但如果 depth 是一个字符串或日期的字符串表示，就像我的情况一样，它可能是一个非常昂贵的操作。 - ecoe

1

在许多情况下，速度并不重要，清晰度才是更重要的问题。如果性能很重要，那么你真的必须进行测试，而不是做出假设，如果发现太慢了，就添加一个索引并再次测试。 - G. Grothendieck

1

这里是使用 lapply() 的另一种解决方案。使用 unique(df$depth) 创建唯一的 depth 值向量，然后对于每个这样的值，仅对那些 depth 等于或小于特定 depth 值的 value 值进行子集操作。然后计算唯一的 value 值的长度。该长度值存储在 cumsum 中，然后 depth=x 将给出特定深度级别的值。使用 do.call(rbind,...) 将其制作为一个数据框。

do.call(rbind,lapply(unique(df$depth), 
               function(x)
             data.frame(depth=x,cumsum=length(unique(df$value[df$depth<=x])))))
  depth cumsum
1     1      2
2     2      2
3     3      3

- Didzis Elferts

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Arun · Accepted Answer

我认为这是使用factor和仔细设置levels的完美案例。我将使用data.table来实现这个想法。请确保您的value列是character类型（不是必需的）。

step 1: Get your data.frame converted to data.table by taking just unique rows.

require(data.table)
dt <- as.data.table(unique(df))
setkey(dt, "depth") # just to be sure before factoring "value"

step 2: Convert value to a factor and coerce to numeric. Make sure to set the levels yourself (it is important).
```
dt[, id := as.numeric(factor(value, levels = unique(value)))]
```

step 3: Set key column to depth for subsetting and just pick the last value

 setkey(dt, "depth", "id")
 dt.out <- dt[J(unique(depth)), mult="last"][, value := NULL]

#    depth id
# 1:     1  2
# 2:     2  2
# 3:     3  3

step 4: Since all values in the rows with increasing depth should have at least the value of the previous row, you should use cummax to get the final output.
```
dt.out[, id := cummax(id)]
```

编辑： 上面的代码仅用于说明目的。实际上，您根本不需要第三列。这就是我编写最终代码的方式。

require(data.table)
dt <- as.data.table(unique(df))
setkey(dt, "depth")
dt[, value := as.numeric(factor(value, levels = unique(value)))]
setkey(dt, "depth", "value")
dt.out <- dt[J(unique(depth)), mult="last"]
dt.out[, value := cummax(value)]

这是一个更棘手的例子以及代码输出结果：

df <- structure(list(depth = c(1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 6), 
                value = structure(c(1L, 2L, 3L, 4L, 1L, 3L, 4L, 5L, 6L, 1L, 1L), 
                .Label = c("a", "b", "c", "d", "f", "g"), class = "factor")), 
                .Names = c("depth", "value"), row.names = c(NA, -11L), 
                class = "data.frame")
#    depth value
# 1:     1     2
# 2:     2     4
# 3:     3     4
# 4:     4     5
# 5:     5     6
# 6:     6     6