仅提取数据集中第一次出现的行的高效方法是什么？

Question

仅提取数据集中第一次出现的行的高效方法是什么？

4

我有一个患者就诊数据框，想要提取每位患者最早的就诊记录（可以使用顺序就诊ID实现）。我编写的代码可以工作，但我相信使用dplyr可以更有效地执行此任务。您推荐使用什么方法？

以下是4名患者10次就诊的示例：

encounter_ID <- c(1021, 1022, 1013, 1041, 1007, 1002, 1003, 1043, 1085, 1077)
patient_ID <- c(855,721,821,855,423,423,855,721,423,855)
gender <- c(0,0,1,0,1,1,0,0,1,0)
df <- data.frame(encounter_ID, patient_ID, gender)

结果（期望和实际）：

    encounter_ID    patient_ID  gender
    1003            855         0
    1022            721         0
    1013            821         1
    1002            423         1

我的方法

1）提取唯一病人列表

list.patients <- unique(df$patient_ID)

2) 创建一个空数据框，以接收每个病人第一次就诊的输出结果

one.encounter <- data.frame()

3) 遍历列表中的每个患者，提取他们的第一次就诊记录并填充我们的数据框。

for (i in 1:length(list.patients)) {
one.patient <- df %>% filter(patient_ID==list.patients[i])
one.patient.ordered <- one.patient[order(one.patient$encounter_ID),]
first.encounter <- head(one.patient.ordered, n=1)
one.encounter <- rbind(one.encounter, first.encounter)
}

- A. Beal

7个回答

4

由于OP要求在执行时间方面寻找高效的方法，因此这里提供了答案的基准测试以及一种data.table方法。

#Unit: milliseconds
#            expr        min         lq       mean     median         uq        max neval
#          OP(df) 1354.49200 1398.15245 1481.16068 1467.31151 1531.93056 2124.05586   100
#        Mike(df)  587.33074  606.33194  649.87766  621.65719  658.96548 1076.12302   100
#   Fernandes(df)  177.80735  182.97910  206.64074  185.91444  198.83281  430.96393   100
#       `5th`(df)   60.55170   64.98082   77.55248   67.73171   71.54677  208.47656   100
#       SmitM(df)   52.70000   53.93696   59.05506   54.84035   58.92260  175.24284   100
#   Jan_Boyer(df)   30.70666   33.44665   43.04396   34.46983   35.69736  223.02998   100
#  data_table(df)   11.51547   12.38410   14.60907   13.08038   15.25540   43.71229   100
# Moody_dplyr(df)  234.08792  241.02003  260.19283  245.20301  259.82435  517.03117   100
# Moody_baseR(df)   67.05192   72.00578   89.50914   74.64688   77.58169  299.56125   100

代码和数据

library(microbenchmark)
library(tidyverse)
library(data.table)

n <- 1e6
set.seed(1)
df <- data.frame(encounter_ID = sample(1000:1999, size = n, replace = TRUE), 
                 patient_ID = sample(700:900, n, TRUE), 
                 gender = sample(0:1, n, TRUE))

benchmark <- microbenchmark(
  OP(df),
  Mike(df),
  Fernandes(df),
  `5th`(df),
  SmitM(df),
  Jan_Boyer(df),
  data_table(df),
  Moody_dplyr(df),
  Moody_baseR(df)
)

autoplot(benchmark)

迄今为止的解决方案。

Mike <- function(df) {
  df %>%  
    arrange(patient_ID, encounter_ID) %>% 
    group_by(patient_ID) %>% 
    filter(row_number()==1)
}

SmitM <- function(df) {
  df %>% 
    group_by(patient_ID, gender) %>% 
    summarise(encounter_ID = min(encounter_ID))
}

Fernandes <- function(df) {
  x <- dplyr::arrange(df, encounter_ID)
  x[!duplicated(x$patient_ID),]
}

`5th` <- function(df) {
  df_ordered <- df[order(df$patient_ID, df$encounter_ID), ]
  df_ordered[match(unique(df_ordered$patient_ID), df_ordered$patient_ID), ]
}

Jan_Boyer <- function(df) {
  df <- df[order(df$encounter_ID),] 
  df[!duplicated(df$patient_ID),]
}

data_table <- function(df) {
  setDT(df, key = 'encounter_ID')
  df[df[, .I[1], by = patient_ID]$V1]
}

OP <- function(df) {
  list.patients <- unique(df$patient_ID)
  one.encounter <- data.frame()

  for (i in 1:length(list.patients)) {
    one.patient <- df %>% filter(patient_ID == list.patients[i])
    one.patient.ordered <- one.patient[order(one.patient$encounter_ID), ]
    first.encounter <- head(one.patient.ordered, n = 1)
    one.encounter <- rbind(one.encounter, first.encounter)
  } 
}

Moody_dplyr <- function(df) {
  df %>% group_by(patient_ID) %>% top_n(-1,encounter_ID)
}

Moody_baseR <- function(df) {
  subset(df, as.logical(ave(encounter_ID, patient_ID, FUN = function(x) x == min(x))))
}

- markus

我知道 data.table 会更快 :D 但是为什么我们的基准测试时间差别这么大呢？你可能在使用 Linux 吗？我的测试是在 Windows 机器上完成的。 - 5th

1

@5th 可能是 df 的大小问题。你尝试过使用我生成的相同数据进行基准测试吗？我在使用 Linux。 - markus

是的，那很有道理。我不知怎么忽略了你生成了一个更大的数据集。 - 5th

谢谢您集成我的回答，我添加了一个选项，您介意也将其添加吗？ - moodymudskipper

我刚来这里，读完答案后打算创建一个类似的基准测试（使用我的10万行数据集），但是Markus已经做到了！感谢大家提供有用的建议和比较。 - A. Beal

3

您可以尝试以下方法：

df2 <- df %>% 
          group_by(patient_ID, gender) %>% 
          summarise(encounter_ID = min(encounter_ID))

- SmitM

1

虽然这个解决方案有效，但如果您的数据框中有100个变量，则必须将它们全部放入group_by语句中，以便它们在df2中保留。这可能会大大减慢您的数据处理步骤。 - Mike

谢谢，SmitM！看到这个选项很有用。确实，Mike是对的，实际数据集有更多的列，使得这个解决方案在实践中难以使用。 - A. Beal

你可以仅按病人分组，然后使用 left_join 将结果合并到原始数据集中。这在基础 R 中也可以实现，类似于 merge(df, aggregate(encounter_ID ~ patient_ID, df , min))。我认为这种方法非常易读，但比这里提到的其他几种解决方案慢。 - moodymudskipper

3

通常情况下，R语言在向量化操作时速度最快。因此，当您要求更有效的解决方法时，问题是您所指的含义是什么？

为了说明这一点，我将展示一个使用base R解决方案，并运行一个microbenchmark：

microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3())
Unit: microseconds
     expr    min      lq     mean  median     uq     max neval
 myfun1() 3997.1 4416.10 6086.848 5129.65 6215.6 64014.4   100
 myfun2()  834.7  993.50 1404.901 1083.95 1247.5 20456.2   100
 myfun3()  133.3  162.75  258.533  193.75  233.8  3561.7   100

您的解决方案是myfun1()，@SmitM的dplyr版本是myfun2()，我的解决方案（myfun3）如下：

df_ordered=df[order(df$patient_ID,df$encounter_ID),]
df_ordered[match(unique(df_ordered$patient_ID),df_ordered$patient_ID),]

现在你可以选择自己喜欢的内容：使用dplyr解决方案非常易于阅读，我认为也可以导出到其他的编程语言。使用基本R的解决方案速度很快，但通常不容易阅读，并且据我所知不能导出到其他语言。

我在这里发布了基本R版本，因为它相对容易阅读，因为每个函数都像其名称一样执行其功能，尽管dplyr看起来更美观。

- 5th

嗨5th，当我说“更有效率”时，我应该更具体一些，但像你猜测的那样，我是在考虑执行时间。非常感谢您提供了这个很好的示例，展示了如何使用微基准测试比较结果，并提供了建议（更快）的选项！ - A. Beal

没问题。是的，microbenchmark包非常方便:D 顺便说一下，data.table包在执行速度方面可以胜过base R函数。通常dplyr不能。另外，如果您想获得投票权，则网站导览会给您+2声望（我认为）。 - 5th

啊，又一个好提示！我经常查阅StackOverflow，但是现在才创建了一个账户，一直在想哪里可以点赞 :-）。看起来我只差2分就能到达那个级别了，我会去找网站导览的，谢谢。 - A. Beal

1

在下面的dplyr代码中，我会按照两个id进行排序，然后按病人分组。在筛选语句中使用row_numer()==1将抓取每个病人的最小encounter_id，因为你按照这两个变量进行了排序并且按病人ID进行了分组。

encounter_ID <- c(1021, 1022, 1013, 1041, 1007, 1002, 1003, 1043, 1085, 1077)
patient_ID <- c(855,721,821,855,423,423,855,721,423,855)
gender <- c(0,0,1,0,1,1,0,0,1,0)
df <- data.frame(encounter_ID, patient_ID, gender)

library(dplyr)



df2 <- df %>%  
        arrange(patient_ID, encounter_ID) %>% 
        group_by(patient_ID) %>% 
        filter(row_number()==1)

- Mike

@AdamSampson，有趣... filter(row_number... 可以用于数据库对象吗？我没有尝试过，但通常我大多数时间都在内存中处理。 - r2evans

1

@AdamSampson，请把你的评论放回去，它比我的（现在已删除的）评论更有上下文！ - r2evans

1

另一个选项。

x = dplyr::arrange(df, encounter_ID)
x[!duplicated(x$patient_ID),]
#  encounter_ID patient_ID gender
#1         1002        423      1
#2         1003        855      0
#4         1013        821      1
#6         1022        721      0

- Thiago Fernandes

谢谢你，Fernandes和@jan-boyer！我曾经研究过duplicated()，但不确定如何确保获得第一次遇到的结果。现在很明显你们两个分享了建议 - 非常感谢！ - A. Beal

1

你可以使用 top_n ：

library(dplyr)
df %>% group_by(patient_ID) %>% top_n(-1,encounter_ID)
# # A tibble: 4 x 3
# # Groups:   patient_ID [4]
#   encounter_ID patient_ID gender
#          <dbl>      <dbl>  <dbl>
# 1         1022        721      0
# 2         1013        821      1
# 3         1002        423      1
# 4         1003        855      0

虽然不是特别快，但这是符合惯用法的 dplyr 方法。

使用基础的 R，速度会更快：

subset(df, as.logical(ave(encounter_ID, patient_ID, FUN = function(x) x == min(x))))

- moodymudskipper

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jan Boyer · Accepted Answer

以下是一种基于R语言的解决方案，无需使用dplyr即可高效完成：

duplicated函数将首次遇到某个病人ID的行编码为FALSE，并将该病人的所有后续行编码为TRUE（在此处，我们通过在duplicated前添加!来颠倒这种编码方式），因此，如果您已经按照encounter_ID对数据框进行了排序，您可以使用它来仅选择第一个遭遇。

df <- df[order(df$encounter_ID),] #order dataframe by encounter id
#subset to rows that are not duplicates of a previous encounter for that patient
first <- df[!duplicated(df$patient_ID),]