如何按组(ID)复制最后一行?

8
我有一个时间内表面接触的数据框。我想简单地为每个ActivityID追加最后一行的副本:
head(movsdf.rbind)
  ActivityID CareType HCWType Orientation    Surface       Date     Time       Dev.Date.Time SurfaceCategories
1         01       IV    RN01  leftFacing AlcOutside 2019-08-03 11:08:01 2019-08-03 11:08:01       HygieneArea
2         01       IV    RN01  leftFacing         In 2019-08-03 11:08:12 2019-08-03 11:08:12                In
3         01       IV    RN01  leftFacing       Door 2019-08-03 11:08:12 2019-08-03 11:08:12        FarPatient
4         02       IV    RN01  leftFacing       Door 2019-08-03 11:08:18 2019-08-03 11:08:18        FarPatient
5         02       IV    RN01  leftFacing      Other 2019-08-03 11:08:22 2019-08-03 11:08:22        FarPatient
6         03       IV    RN01  leftFacing      Table 2019-08-03 11:10:26 2019-08-03 11:10:26       NearPatient

示例数据:

movsdf.rbind<-data.frame(ActivityID=rep(1:4, each=10),Surface=rep(c("In","Table","Out"),each=10))

我可以从这里找到解决方法:

repeatss <- aggregate(movsdf.rbind, by=list(movsdf.rbind$ActivityID), FUN = function(x) { last = tail(x,1) })

movsdf.rbind <-rbind(movsdf.rbind, repeatss)

这样做可以解决问题,但看起来有些笨重,而且数据不是按顺序排列的(虽然这并不是真正要紧的事情,但我感觉在dplyrdata.table中可能存在更优雅的方法)。有什么想法吗?


2
请问您能否提供一个可重现的数据示例? - Sotos
你是指使用 DT 来操作 data.table 吗? - Jaap
是的,抱歉我应该写得更清楚。 - HCAI
5个回答

7

使用 slice 的另一种替代方法:

library(dplyr)

DF %>% 
  group_by(ActivityID) %>% 
  slice(c(1:n(),n()))

这将会给出:

# A tibble: 9 x 9
# Groups:   ActivityID [3]
  ActivityID CareType HCWType Orientation Surface    Date      Time     Dev.Date.Time     SurfaceCategori~
       <int> <chr>    <chr>   <chr>       <chr>      <chr>     <chr>    <chr>             <chr>           
1          1 IV       RN01    leftFacing  AlcOutside 2019-08-~ 11:08:01 2019-08-03 11:08~ HygieneArea     
2          1 IV       RN01    leftFacing  In         2019-08-~ 11:08:12 2019-08-03 11:08~ In              
3          1 IV       RN01    leftFacing  Door       2019-08-~ 11:08:12 2019-08-03 11:08~ FarPatient      
4          1 IV       RN01    leftFacing  Door       2019-08-~ 11:08:12 2019-08-03 11:08~ FarPatient      
5          2 IV       RN01    leftFacing  Door       2019-08-~ 11:08:18 2019-08-03 11:08~ FarPatient      
6          2 IV       RN01    leftFacing  Other      2019-08-~ 11:08:22 2019-08-03 11:08~ FarPatient      
7          2 IV       RN01    leftFacing  Other      2019-08-~ 11:08:22 2019-08-03 11:08~ FarPatient      
8          3 IV       RN01    leftFacing  Table      2019-08-~ 11:10:26 2019-08-03 11:10~ NearPatient     
9          3 IV       RN01    leftFacing  Table      2019-08-~ 11:10:26 2019-08-03 11:10~ NearPatient

两个基本的 R 选择:

# one
lastrows <- cumsum(aggregate(CareType ~ ActivityID, DF, length)[[2]])
DF[sort(c(seq(nrow(DF)), lastrows)),]

# two
idx <- unlist(tapply(1:nrow(DF), DF$ActivityID, FUN = function(x) c(x, tail(x, 1))))
DF[idx,]

两者都会得到相同的结果。


两个的替代方案:

library(data.table)
setDT(DF)          # convert 'DF' to a data.table

# one
DF[DF[, .I[c(1:.N,.N)], by = ActivityID]$V1]

# two
DF[, .SD[c(1:.N,.N)], by = ActivityID]

使用的数据:
DF <- structure(list(ActivityID = c(1L, 1L, 1L, 2L, 2L, 3L),
                     CareType = c("IV", "IV", "IV", "IV", "IV", "IV"),
                     HCWType = c("RN01", "RN01", "RN01", "RN01", "RN01", "RN01"),
                     Orientation = c("leftFacing", "leftFacing", "leftFacing", "leftFacing", "leftFacing", "leftFacing"),
                     Surface = c("AlcOutside", "In", "Door", "Door", "Other", "Table"),
                     Date = c("2019-08-03", "2019-08-03", "2019-08-03", "2019-08-03", "2019-08-03", "2019-08-03"),
                     Time = c("11:08:01", "11:08:12", "11:08:12", "11:08:18", "11:08:22", "11:10:26"),
                     Dev.Date.Time = c("2019-08-03 11:08:01", "2019-08-03 11:08:12", "2019-08-03 11:08:12", "2019-08-03 11:08:18", "2019-08-03 11:08:22", "2019-08-03 11:10:26"),
                     SurfaceCategories = c("HygieneArea", "In", "FarPatient", "FarPatient", "FarPatient", "NearPatient")),
                class = "data.frame", row.names = c(NA, -6L))

1
感谢您提供的所有示例! - HCAI
1
另一个基本选项:DF[rep(1:nrow(DF), 1 + !duplicated(DF$ActivityID, fromLast=TRUE)),],与Ronak的类似。 - Frank

3

使用dplyrtidyr,可以尝试以下方法(使用@Jaap的示例数据):

DF %>%
 group_by(ActivityID) %>%
 uncount((row_number() == max(row_number())) + 1)

  ActivityID CareType HCWType Orientation Surface   Date     Time   Dev.Date.Time   SurfaceCategori…
       <int> <chr>    <chr>   <chr>       <chr>     <chr>    <chr>  <chr>           <chr>           
1          1 IV       RN01    leftFacing  AlcOutsi… 2019-08… 11:08… 2019-08-03 11:… HygieneArea     
2          1 IV       RN01    leftFacing  In        2019-08… 11:08… 2019-08-03 11:… In              
3          1 IV       RN01    leftFacing  Door      2019-08… 11:08… 2019-08-03 11:… FarPatient      
4          1 IV       RN01    leftFacing  Door      2019-08… 11:08… 2019-08-03 11:… FarPatient      
5          2 IV       RN01    leftFacing  Door      2019-08… 11:08… 2019-08-03 11:… FarPatient      
6          2 IV       RN01    leftFacing  Other     2019-08… 11:08… 2019-08-03 11:… FarPatient      
7          2 IV       RN01    leftFacing  Other     2019-08… 11:08… 2019-08-03 11:… FarPatient      
8          3 IV       RN01    leftFacing  Table     2019-08… 11:10… 2019-08-03 11:… NearPatient     
9          3 IV       RN01    leftFacing  Table     2019-08… 11:10… 2019-08-03 11:… NearPatient 

或者:

DF %>%
 group_by(ActivityID) %>%
 uncount((row_number() == n()) + 1)

谢谢您的帮助。我想接受Jaap提供的额外代码片段。 - HCAI

3

如果我们想要为每个组仅重复最后一行,则只需知道每个组的最后一行号码即可。我们可以使用 duplicated 函数并将参数 fromLast 设置为 TRUE 来获取这些行号,然后将它们与当前行相加。以下是 @Jaap 提供的数据:

DF[sort(c(seq_len(nrow(DF)), which(!duplicated(DF$ActivityID, fromLast = TRUE)))),]


#    ActivityID CareType HCWType Orientation    Surface       Date     Time       Dev.Date.Time SurfaceCategories
#1            1       IV    RN01  leftFacing AlcOutside 2019-08-03 11:08:01 2019-08-03 11:08:01       HygieneArea
#2            1       IV    RN01  leftFacing         In 2019-08-03 11:08:12 2019-08-03 11:08:12                In
#3            1       IV    RN01  leftFacing       Door 2019-08-03 11:08:12 2019-08-03 11:08:12        FarPatient
#3.1          1       IV    RN01  leftFacing       Door 2019-08-03 11:08:12 2019-08-03 11:08:12        FarPatient
#4            2       IV    RN01  leftFacing       Door 2019-08-03 11:08:18 2019-08-03 11:08:18        FarPatient
#5            2       IV    RN01  leftFacing      Other 2019-08-03 11:08:22 2019-08-03 11:08:22        FarPatient
#5.1          2       IV    RN01  leftFacing      Other 2019-08-03 11:08:22 2019-08-03 11:08:22        FarPatient
#6            3       IV    RN01  leftFacing      Table 2019-08-03 11:10:26 2019-08-03 11:10:26       NearPatient
#6.1          3       IV    RN01  leftFacing      Table 2019-08-03 11:10:26 2019-08-03 11:10:26       NearPatient

2
这里是一个基于R语言的解决方案。
result <- lapply(split(movsdf.rbind, movsdf.rbind$ActivityID), function(DF){
  rbind(DF, DF[nrow(DF), ])
})
result <- do.call(rbind, result)

result
#     ActivityID value
#1.1           1     1
#1.2           1     2
#1.3           1     3
#1.31          1     3
#2.4           2     4
#2.5           2     5
#2.6           2     6
#2.61          2     6
#3.7           3     7
#3.8           3     8
#3.9           3     9
#3.91          3     9

如果新的行号不美观,可以使用以下方法使它们连续:
row_number() OVER(ORDER BY column_name)
最初的回答:
row.names(result) <- NULL

数据创建代码。

movsdf.rbind <- data.frame(ActivityID = rep(1:3, each = 3),
                           value = 1:9)

谢谢你的帮助。我想接受Jaap提供的额外代码片段。 - HCAI

1
我们可以使用split,然后map将每个数据框的最后一行bind_rows在一起。
library(dplyr)
library(purrr)
movsdf.rbind %>% 
      split(.$ActivityID) %>% 
      #Use map if you need the output as a list
      #instead of filter you can try slice(.x, n()) 
      map_df(~bind_rows(.x, filter(.x, row_number()==n())))

谢谢你的帮助。我想接受Jaap提供的额外代码片段。 - HCAI

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接