将dplyr tbl中的列提取为向量

Question

将dplyr tbl中的列提取为向量

268

在具有数据库后端的dplyr tbl中，获取一个列作为向量的更简洁方法是否存在（即不能直接对数据框/表进行子集操作）？

require(dplyr)
db <- src_sqlite(tempfile(), create = TRUE)
iris2 <- copy_to(db, iris)
iris2$Species
# NULL

那会太容易了，所以

collect(select(iris2, Species))[, 1]
# [1] "setosa"     "setosa"     "setosa"     "setosa"  etc.

但它看起来有些笨重。

- nacnudus

1

collect(iris2)$Species 更简洁吗？ - CJ Yetman

8个回答

124

根据 @nacnudus 的评论，看起来 dplyr 0.6 中实现了一个 pull 函数：

iris2 %>% pull(Species)

对于较旧版本的dplyr，下面有一个很棒的函数，可以使提取列变得更加美观（更容易输入和阅读）：

pull <- function(x,y) {x[,if(is.name(substitute(y))) deparse(substitute(y)) else y, drop = FALSE][[1]]}

这让你可以执行以下任一操作：

iris2 %>% pull('Species')
iris2 %>% pull(Species)
iris2 %>% pull(5)

导致...

 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4

它也可以很好地处理数据框：

> mtcars %>% pull(5)
 [1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43
[28] 3.77 4.22 3.62 3.54 4.11

在 dplyr 的 v0.2 版本中，一个好的方法是：

iris2 %>% select(Species) %>% collect %>% .[[5]]

或者如果您更喜欢：

iris2 %>% select(Species) %>% collect %>% .[["Species"]]

或者，如果您的表不太大，只需...

iris2 %>% collect %>% .[["Species"]]

- Tommy O'Dell

2

我喜欢你的拉取函数。只针对只有一个变量的情况，我会添加一个简化：

pull <- function(x, y) {   if (ncol(x) == 1) y <- 1 else y   x[ , if (is.name(substitute(y))) deparse(substitute(y)) else y, drop = FALSE][[1]] }

这样你就可以使用 iris2 %>% pull()。 - Rappster

7

你也可以使用 magrittr 包中的展开操作符 (%$%) 从数据框中提取向量。例如，iris2 %>% select(Species) %>% collect() %$% Species 将提取出 Species 列的向量。 - seasmith

@Luke1018 你应该将这个评论转化为一个回答。 - rrs

pull()将在dplyr版本0.6中实现 https://github.com/tidyverse/dplyr/commit/0b9aabf6c06c9cd3b784b155044d497d4b93df3e - nacnudus

86

你还可以使用unlist，我觉得它更容易阅读，因为您不需要重复列名或指定索引。

iris2 %>% select(Species) %>% unlist(use.names = FALSE)

- StanislawSwierc

1

这似乎是最通用的方法，因为它在向量和数据框上的工作方式完全相同，即它使函数更加不可知。 - geotheory

我只是在寻找这个确切问题的答案，而unlist正是我所需要的。谢谢！ - Andrew Brēza

unlist可以从多列中提取值（将所有值合并为单个向量），而dplyr :: pull仅限于单个列。 - filups21

23

我会使用 magrittr 中的 extract2 便捷函数：

library(magrittr)
library(dplyr)

iris2 %>%
  select(Species) %>%
  extract2(1)

- Hugh

你是不是想在 select 和 extract2 之间使用 collect() 函数？ - nacnudus

10

use_series(Species) 可能更易读。感谢您让我知道这些函数，还有其他几个也很方便。 - nacnudus

22

我可能会这样写：

collect(select(iris2, Species))[[1]]

dplyr旨在处理数据tbl，因此获取单列数据的最佳方式没有比这更好的了。

- hadley

不能说比这更公平了。当我尝试使用unique(table$column)检查虚假值时，它在控制台中交互地出现了。 - nacnudus

4

针对这种情况，您还可以使用 group_by(column) %.% tally() 的方式进行统计。 - hadley

13

在许多情况下，我们实际上需要提取向量，因此将参数 drop = TRUE 添加到 dplyr::select 将非常有用。 - Antoine Lizée

这是我从Sparklyr sdf中获取列的唯一方法。在0.7.8版本上，Pull对我来说无法工作。 - Meep

17

@Luke1018在评论中提出了这个解决方案：

你也可以使用magrittr通道操作符(%$%)从数据框中提取一个向量。

例如：

iris2 %>% select(Species) %>% collect() %$% Species

我认为这个问题值得单独回答。

- rrs

我正在寻找这个。 - Diego-MX

如果我想传递一个包含列名的字符串变量，而不是列名本身，我该怎么做？ - mzuba

@mzuba tibble(x = 1:10, y = letters[1:10]) %>% select_("x") %>% unlist()，如果你想的话，你也可以在末尾添加另一个%>% unname()，但是对于我的目的来说，我发现最后一个管道链链接并不必要。你还可以在unlist()命令中指定use.names = FALSE，这与在管道链中添加unname()的效果相同。 - Mark White

1

@mzuba 我现在会使用 pull 命令。我的解决方案是在 dplyr 版本0.6之前编写的。 - rrs

1

请注意，%$% 可以用于任何列表，而 pull() 则不行。 - wint3rschlaefer

8

如果您习惯使用方括号进行索引，另一个选择是将常规索引方法包装在对deframe()的调用中，例如：

library(tidyverse)

iris2 <- as_tibble(iris)

# using column name
deframe(iris2[, 'Sepal.Length'])

# [1] 5.1 4.9 4.7 4.6 5.0 5.4

# using column number
deframe(iris2[, 1])

# [1] 5.1 4.9 4.7 4.6 5.0 5.4

这两种方法都是获取tibble列的不错选择，pull() 是其中之一。

- Keith Hughitt

1

另一种更快的提取列向量的方法是使用c()函数将数据框转换为列表，然后：

c(iris)$Species
c(iris)$Sepal.Length

使用dplyr方法将列转换为向量：

iris %>% select(Sepal.Length) %>% 
         as.matrix() %>% 
         as.vector()

如果您想将数据集中的所有值作为向量返回，只需执行以下操作：

# I have this tibble:
iris %>% as_tibble() %>% head(3)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa

按照列值的顺序进行以下操作（5.1，4.9，4.7，...）：

iris %>% as_tibble() %>% as.matrix %>% as.vector()

  [1] "5.1"        "4.9"        "4.7"       
  [4] "4.6"        "5.0"        "5.4"       
  [7] "4.6"        "5.0"        "4.4"       
 [10] "4.9"        "5.4"        "4.8" 
 ....
[742] "virginica"  "virginica"  "virginica" 
[745] "virginica"  "virginica"  "virginica" 
[748] "virginica"  "virginica"  "virginica"

并对行值顺序（5.1、3.5、1.4等）执行此操作：

iris %>% as_tibble() %>% as.matrix %>% t() %>% as.vector()

  [1] "5.1"        "3.5"        "1.4"       
  [4] "0.2"        "setosa"     "4.9"       
  [7] "3.0"        "1.4"        "0.2"       
 [10] "setosa"     "4.7"        "3.2"       
 [13] "1.3"        "0.2"        "setosa"
 ....
[739] "2.0"        "virginica"  "6.2"       
[742] "3.4"        "5.4"        "2.3"       
[745] "virginica"  "5.9"        "3.0"       
[748] "5.1"        "1.8"        "virginica"

- rubengavidia0x

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Lorenz Walthert · Accepted Answer

使用dplyr >= 0.7.0，您可以使用pull()从tbl中获取向量。

library(dplyr, warn.conflicts = FALSE)
db <- src_sqlite(tempfile(), create = TRUE)
iris2 <- copy_to(db, iris)
vec <- pull(iris2, Species)
head(vec)
#> [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"