整理数据集:通过收集多个列?

11

我想通过这种方式操作数据来整理数据集:

age gender  education       previous_comp_exp   tutorial_time   qID.1    time_taken.1   qID.2    time_taken.2   
18  Male    Undergraduate   casual gamer        62.17926        sor9     39.61206       sor8     19.4892
24  Male    Undergraduate   casual gamer        85.01288        sor9     50.92343       sor8     16.15616

成为这样:

age gender  education       previous_comp_exp   tutorial_time   qID      time_taken 
18  Male    Undergraduate   casual gamer        62.17926        sor9     39.61206       
18  Male    Undergraduate   casual gamer        62.17926        sor8     19.4892
24  Male    Undergraduate   casual gamer        85.01288        sor9     50.92343       
24  Male    Undergraduate   casual gamer        85.01288        sor8     16.15616

我已经尝试使用gather(),但我只能让它与一个列一起使用,并且我一直收到这个警告:

警告信息:测量变量的属性不相同;它们将被删除

有什么建议吗?


这不是一个错误,而是一个警告,告诉你你堆叠的两列具有不同的属性(也许它们都是因子,但具有不同的水平),因此这些属性在输出中被删除了。这个这个可能有助于处理从宽到长的重塑,因为你有需要堆叠的一对列。 - eipi10
2个回答

11

使用来自data.tablemelt(参见?patterns):

library(data.table)

melt(setDT(df), measure = patterns("^qID", "^time_taken"),
     value.name = c("qID", "time_taken"))

结果:

   age gender     education previous_comp_exp tutorial_time variable  qID time_taken
1:  18   Male Undergraduate      casual_gamer      62.17926        1 sor9   39.61206
2:  24   Male Undergraduate      casual_gamer      85.01288        1 sor9   50.92343
3:  18   Male Undergraduate      casual_gamer      62.17926        2 sor8   19.48920
4:  24   Male Undergraduate      casual_gamer      85.01288        2 sor8   16.15616

或者使用 tidyr

library(dplyr)
library(tidyr)

df %>%
  gather(variable, value, qID.1:time_taken.2) %>%
  mutate(variable = sub("\\.\\d$", "", variable)) %>%
  group_by(variable) %>%
  mutate(ID = row_number()) %>%
  spread(variable, value, convert = TRUE) %>%
  select(-ID)

结果:

# A tibble: 4 x 7
    age gender     education previous_comp_exp tutorial_time   qID time_taken
  <int> <fctr>        <fctr>            <fctr>         <dbl> <chr>      <dbl>
1    18   Male Undergraduate      casual_gamer      62.17926  sor9   39.61206
2    18   Male Undergraduate      casual_gamer      62.17926  sor8   19.48920
3    24   Male Undergraduate      casual_gamer      85.01288  sor9   50.92343
4    24   Male Undergraduate      casual_gamer      85.01288  sor8   16.15616

注意:

对于tidyr方法,使用convert=TRUEtime_taken转换回numeric,因为当与qID列一起使用gather函数时,它被强制转换为字符。

数据:

df = structure(list(age = c(18L, 24L), gender = structure(c(1L, 1L
), .Label = "Male", class = "factor"), education = structure(c(1L, 
1L), .Label = "Undergraduate", class = "factor"), previous_comp_exp = structure(c(1L, 
1L), .Label = "casual_gamer", class = "factor"), tutorial_time = c(62.17926, 
85.01288), qID.1 = structure(c(1L, 1L), .Label = "sor9", class = "factor"), 
    time_taken.1 = c(39.61206, 50.92343), qID.2 = structure(c(1L, 
    1L), .Label = "sor8", class = "factor"), time_taken.2 = c(19.4892, 
    16.15616)), .Names = c("age", "gender", "education", "previous_comp_exp", 
"tutorial_time", "qID.1", "time_taken.1", "qID.2", "time_taken.2"
), class = "data.frame", row.names = c(NA, -2L))

1
优秀的使用了melt! - agstudy
@useR,在Tidy方法中,年龄变量消失了。有什么想法为什么会这样? - stenfeio
@stenfeio 谢谢你发现了这个问题!实际上我读取数据时出现了错误。由于我使用的是 read.table,所以 casualgamer 被视为不同的列,而第一列被视为行名。它没有抛出错误,因为列数恰好匹配。请查看我的编辑。 - acylam
我不知道 pattern 函数,好棒的一行代码。@useR 我有一个问题,在看到你的解决方案之前我尝试过:time <- paste("time_taken", 1:2, sep = ".")qID <- paste("qID", 1:2, sep = ".")melt(DT,mesure = list(time,qID), value.name = c("qID", "time_taken"))但是它给了我一个错误。你有任何想法为什么吗? - denis
@denis 只是个小错误。你拼错了 "measure"。所以melt使用默认值 "如果缺失,则所有非id列都将被分配给它" (参见 ?melt)。 - acylam
@useR 哇,抱歉我想我有点累了。非常感谢您的帮助和友好的回答。确实它使用了默认值。 - denis

6

在基本的 R 语言中,您可以使用强大的 reshape 函数来将数据从宽格式转换为长格式,只需一行代码:

   reshape(dx,direction="long",
        varying=list(grep("qID",colnames(dx)),
                     grep("time_taken",colnames(dx))),
        v.names=c("qID","time_taken"))

     age gender     education previous_comp_exp tutorial_time time  qID time_taken id
1.1  18   Male Undergraduate      casual_gamer      62.17926    1 sor9   39.61206  1
2.1  24   Male Undergraduate      casual_gamer      85.01288    1 sor9   50.92343  2
1.2  18   Male Undergraduate      casual_gamer      62.17926    2 sor8   19.48920  1
2.2  24   Male Undergraduate      casual_gamer      85.01288    2 sor8   16.15616  2

我认为您也错误地读取了数据。您可以使用我的答案中的新dput - acylam
@useR 发现得好,已修复。 - agstudy

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接