将具有多个列的大型数据集从宽格式转换为长格式

5
我是一名有用的助手,可以为您翻译文本。

我有一个非常大的数据集,需要将其从宽格式转换为长格式。

我的数据集看起来像这样:

  COMPANY   PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010 ... REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010 ... COSTSDEC2016
COMPANY A PRODUCT 1            6400           11050              6550               10600         8500        10400           9100             9850
COMPANY A PRODUCT 2            2700            3000              2800                3800         2850         2400           3100             3250
COMPANY B PRODUCT 3            5900            4150              5750                3750         4200         6100           2950             4600
COMPANY B PRODUCT 4             550             600                 0                 650          200          700            100              500
COMPANY B PRODUCT 5            1500            3750               550                2100         1850         1700           3150              450
COMPANY C PRODUCT 6           19300           17250             23600               21250        18200        26950          18200            23900

我希望它们看起来像:

  COMPANY    PRODUCT    DATE  REVENUES  COSTS
COMPANY A  PRODUCT 1  Dec-16     10600   9850
COMPANY A  PRODUCT 1  Feb-10     11050  10400
COMPANY A  PRODUCT 1  Jan-10      6400   8500
COMPANY A  PRODUCT 1  Mar-10      6550   9100
COMPANY A  PRODUCT 2  Dec-16      3800   3250
COMPANY A  PRODUCT 2  Feb-10      3000   2400
COMPANY A  PRODUCT 2  Jan-10      2700   2850
COMPANY A  PRODUCT 2  Mar-10      2800   3100
COMPANY B  PRODUCT 3  Dec-16      3750   4600
COMPANY B  PRODUCT 3  Feb-10      4150   6100
COMPANY B  PRODUCT 3  Jan-10      5900   4200
COMPANY B  PRODUCT 3  Mar-10      5750   2950
COMPANY B  PRODUCT 4  Dec-16       650    500
COMPANY B  PRODUCT 4  Feb-10       600    700
COMPANY B  PRODUCT 4  Jan-10       550    200
COMPANY B  PRODUCT 4  Mar-10         0    100
COMPANY B  PRODUCT 5  Dec-16      2100    450
COMPANY B  PRODUCT 5  Feb-10      3750   1700
COMPANY B  PRODUCT 5  Jan-10      1500   1850
COMPANY B  PRODUCT 5  Mar-10       550   3150
COMPANY C  PRODUCT 6  Dec-16     21250  23900
COMPANY C  PRODUCT 6  Feb-10     17250  26950
COMPANY C  PRODUCT 6  Jan-10     19300  18200
COMPANY C  PRODUCT 6  Mar-10     23600  18200

在Stata中,我会输入reshape long REVENUES COSTS, i(COMPANY PRODUCT) j(DATE) string。在R中该怎么做?
7个回答

8
有几种其他方法可以处理这个问题,比“tidyverse”选项更加简化。以下所有示例都使用JMT2080AD的答案提供的样本数据,并使用set.seed(1)(为了可重复性)。
选项1:基础R的reshape reshape函数并不总是最容易使用的函数,但一旦你理解了它,它就非常强大。在这种情况下,你没有sep,这使得事情有点棘手,因为你必须更具体地指定结果变量名称和应该显示为“times”的值(默认情况下,它们只是连续的数字)。
times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
reshape(yourData, direction = "long", 
        varying = grep("revenues|cost", names(yourData)), sep = "", 
        v.names = c("revenues", "cost"), timevar = "date", times = times)
#             company   product    date revenues cost id
# 1.Jan2010 Company A Product 1 Jan2010     2862 1164  1
# 2.Jan2010 Company A Product 2 Jan2010     2152 1430  2
# 3.Jan2010 Company B Product 3 Jan2010     2073 1932  3
# 4.Jan2010 Company B Product 4 Jan2010      654 2771  4
# 5.Jan2010 Company B Product 5 Jan2010     1015 1004  5
# 6.Jan2010 Company C Product 6 Jan2010      941 2746  6
# ....

这基本就是你要寻找的内容,日期格式可能会有所不同。
选项2:data.table 如果你追求性能,可以看一下" data.table "中的melt,你应该能够像下面这样做。与reshape方法一样,在melt数据之后,你将需要存储"times"以重新引入日期。
(注意:我知道这非常类似于@Uwe的方法。)
library(data.table)
times <- gsub("revenues", "", grep("revenues", names(yourData), value = TRUE))
melt(as.data.table(yourData), measure.vars = patterns("revenues", "cost"),
     value.name = c("revenues", "cost"))[
       , variable := factor(variable, labels = times)][]
#       company   product variable revenues cost
#  1: Company A Product 1  Jan2010     1164 1168
#  2: Company A Product 2  Jan2010     1430 1465
#  3: Company B Product 3  Jan2010     1932  533
#  4: Company B Product 4  Jan2010     2771 1456
#  5: Company B Product 5  Jan2010     1004 2674
# ---                                           
# 20: Company A Product 2  Apr2010     2444 1883
# 21: Company B Product 3  Apr2010     2837 1824
# 22: Company B Product 4  Apr2010     1030 2473
# 23: Company B Product 5  Apr2010     2129  558
# 24: Company C Product 6  Apr2010      814 1693

选项三:merged.stack

我的“splitstackshape”包中有一个名为merged.stack的函数,旨在使这种特定类型的重塑更加容易。您可以使用它来尝试:

library(splitstackshape)
merged.stack(yourData, var.stubs = c("revenues", "cost"), sep = "var.stubs")
#       company   product .time_1 revenues cost
#  1: Company A Product 1 Apr2010     1450 2457
#  2: Company A Product 1 Feb2010     2862 1705
#  3: Company A Product 1 Jan2010     1164 1168
#  4: Company A Product 1 Mar2010     2218 2486
#  5: Company A Product 2 Apr2010     2444 1883
#  6: Company A Product 2 Feb2010     2152 1999
#  7: Company A Product 2 Jan2010     1430 1465
#  8: Company A Product 2 Mar2010     1460  770
#  9: Company B Product 3 Apr2010     2837 1824
# 10: Company B Product 3 Feb2010     2073 1734
# ... 

有一天,我会更新这个函数。在 "data.table" 中的 melt 能够处理半宽输出格式之前,这个函数已经被编写了。我已经想出了部分解决方案,但后来就停止了。

实际上,使用上面链接的函数,解决方案将是简单的:

ReshapeLong_(yourData, c("revenues", "cost"))

选项4:从"tidyverse"中提取

使用tidyverse的其他解决方案似乎采用了一种非常奇怪的方式。更好的解决方案是使用extract将所需数据提取到新列中。您需要首先将数据gather成非常长的格式,然后将数据spread成宽格式。

以下是我会使用的方法:

library(tidyverse)
yourData %>% 
  gather(var, val, -company, -product) %>%
  extract(var, into = c("type", "month", "year"), 
          regex = ("(revenues|cost)(...)(.*)")) %>%
  spread(type, val)
#      company   product month year cost revenues
# 1  Company A Product 1   Apr 2010 2457     1450
# 2  Company A Product 1   Feb 2010 1705     2862
# 3  Company A Product 1   Jan 2010 1168     1164
# 4  Company A Product 1   Mar 2010 2486     2218
# 5  Company A Product 2   Apr 2010 1883     2444
# 6  Company A Product 2   Feb 2010 1999     2152
# ...

1
我认为在 R 中将宽数据转换为长数据最明确(即不需要重命名变量)的方法是使用基本的 R reshape() 函数,并将要“堆叠”的变化列指定为一个list。请参阅this博客文章。
我将使用JMT2080AD's answer中的数据,并将种子设置为set.seed(789)
### Create a list of the variables you want to reshape/stack
reshape.vars <- list(c("revenuesJan2010",   "revenuesFeb2010",  "revenuesMar2010",  "revenuesApr2010"), # revenues
                     c("costJan2010",   "costFeb2010",  "costMar2010",  "costApr2010")) # cost 
### reshape wide to long
reshape(yourData,                      #dataframe
        direction="long",             #wide to long
        varying=reshape.vars, #repeated measures list of indexes for vars to stack/reshape
        timevar="date",              #the repeated measures times
        v.names=c("revenues", "cost")) #the repeated measures names

#     company   product date   revenues cost id
# 1.1 Company A Product 1    1     2250 1574  1
# 2.1 Company A Product 2    1      734 1793  2
# 3.1 Company B Product 3    1      530 1282  3
# 4.1 Company B Product 4    1     1979 1741  4
# 5.1 Company B Product 5    1     1730 2558  5
# 6.1 Company C Product 6    1      550 1757  6
# 1.2 Company A Product 1    2     1932 1048  1
#...
# 5.3 Company B Product 5    3      890 1103  5
# 6.3 Company C Product 6    3     2113 2469  6
# 1.4 Company A Product 1    4     2426 2382  1
# 2.4 Company A Product 2    4      778 2995  2
# 3.4 Company B Product 3    4     1359  989  3
# 4.4 Company B Product 4    4     1618  912  4
# 5.4 Company B Product 5    4      895 2109  5
# 6.4 Company C Product 6    4     1258 2803  6

使用list方法
  • 您不必重命名变量
  • 由于您想要创建的变量在列表中明确定义,因此没有与reshape()有关的错误,即哪些变量应该被堆叠

我发现,即使有100多个要重新整形的变量,如果重命名它们会很麻烦,那么使用复制/粘贴来创建不同的变量列表也不会花费太长时间。


1
这里的棘手之处在于你的日期被打包到了列名中。在制作所需表格之前,必须将其解析出来。我已经遍历了每一列,解析出每个子表格列名中的日期和观察类型,绑定每个子表格,然后对成本/收入进行转换。我相信有更优雅的解决方案。
library(reshape)

## making a table similar to yours here
yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
                       product = paste("Product", 1:6),
                       revenuesJan2010 = round(runif(6, 500, 3000)),
                       revenuesFeb2010 = round(runif(6, 500, 3000)),
                       revenuesMar2010 = round(runif(6, 500, 3000)),
                       revenuesApr2010 = round(runif(6, 500, 3000)),
                       costJan2010 = round(runif(6, 500, 3000)),
                       costFeb2010 = round(runif(6, 500, 3000)),
                       costMar2010 = round(runif(6, 500, 3000)),
                       costApr2010 = round(runif(6, 500, 3000)))

## a function that parses the date from the column name
columnParse <- function(tab){
    colNm   <- names(tab)[3]
    names(tab)[3] <- "value"
    colDate  <- strsplit(colNm, "revenues|cost")[[1]][2]
    colDate  <- gsub("([A-Za-z]+)", "\\1-", colDate)
    tab$date <- colDate
    tab$type <- gsub("(revenues|cost).*", "\\1", colNm)
    return(tab)
}

## running that function against sub tables of your data, then binding
yourDataLong <- do.call(rbind,
                        lapply(3:ncol(yourData),
                               function(x) columnParse(yourData[c(1:2, x)])))

## casting your data on cost/revenue
yourDataCast <- cast(yourDataLong, company+product+date~type, value = "value")

据我所知,“reshape”包目前没有在积极开发。你可能想要转向使用“reshape2”、“data.table”或者“tidyr”…… - A5C1D2H2I1M1N2O1R2T1

1

这里有另一个选项,使用tidyversestringr

yourData <- data.frame(company = c(rep("Company A", 2), rep("Company B", 3), rep("Company C")),
                   product = paste("Product", 1:6),
                   REVENUESJan2010 = round(runif(6, 500, 3000)),
                   REVENUESFeb2010 = round(runif(6, 500, 3000)),
                   REVENUESMar2010 = round(runif(6, 500, 3000)),
                   REVENUESApr2010 = round(runif(6, 500, 3000)),
                   COSTSJan2010 = round(runif(6, 500, 3000)),
                   COSTSFeb2010 = round(runif(6, 500, 3000)),
                   COSTSMar2010 = round(runif(6, 500, 3000)),
                   COSTSApr2010 = round(runif(6, 500, 3000)))

使用 tidyversestringr 的解决方案:

library(tidyverse)
library(stringr)

newData <- yourData %>%
   gather(key = rev.cost.date, value, -company, -product) %>%
   mutate(finance.type = ifelse(str_detect(rev.cost.date, fixed("REVENUES")), "REVENUES", "COSTS")) %>%
   mutate(date = str_replace(rev.cost.date, "REVENUES|COSTS", "")) %>%
   select(-rev.cost.date) %>%
   spread(value = value, key = finance.type) %>%
   mutate(date = paste0(str_sub(date, 0, 3), "-", str_sub(date, 4,8))

1
截至2015年9月19日CRAN上的1.9.6版本,data.table可以同时融合多个列(使用patterns()函数)。因此,以REVENUESCOSTS开头的列可以被收集到两个单独的列中。
此外,日期(月份)被打包到没有分隔符的列名中。这些是使用具有向后查找的正则表达式从列名中提取出来的,并用于替换DATE列的因子级别。
library(data.table)
library(magrittr)
cols <- c("REVENUES", "COSTS")
long <- melt(wide, measure.vars = patterns(cols), value.name = cols, variable.name = "DATE")
months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() 
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
      COMPANY   PRODUCT      DATE REVENUES COSTS
 1: COMPANY A PRODUCT 1   JAN2010     6400  8500
 2: COMPANY A PRODUCT 2   JAN2010     2700  2850
 3: COMPANY B PRODUCT 3   JAN2010     5900  4200
 4: COMPANY B PRODUCT 4   JAN2010      550   200
 5: COMPANY B PRODUCT 5   JAN2010     1500  1850
 6: COMPANY C PRODUCT 6   JAN2010    19300 18200
 7: COMPANY A PRODUCT 1   FEB2010    11050 10400
 8: COMPANY A PRODUCT 2   FEB2010     3000  2400
 9: COMPANY B PRODUCT 3   FEB2010     4150  6100
10: COMPANY B PRODUCT 4   FEB2010      600   700
11: COMPANY B PRODUCT 5   FEB2010     3750  1700
12: COMPANY C PRODUCT 6   FEB2010    17250 26950
13: COMPANY A PRODUCT 1 MARCH2010     6550  9100
14: COMPANY A PRODUCT 2 MARCH2010     2800  3100
15: COMPANY B PRODUCT 3 MARCH2010     5750  2950
16: COMPANY B PRODUCT 4 MARCH2010        0   100
17: COMPANY B PRODUCT 5 MARCH2010      550  3150
18: COMPANY C PRODUCT 6 MARCH2010    23600 18200
19: COMPANY A PRODUCT 1   DEC2016    10600  9850
20: COMPANY A PRODUCT 2   DEC2016     3800  3250
21: COMPANY B PRODUCT 3   DEC2016     3750  4600
22: COMPANY B PRODUCT 4   DEC2016      650   500
23: COMPANY B PRODUCT 5   DEC2016     2100   450
24: COMPANY C PRODUCT 6   DEC2016    21250 23900
      COMPANY   PRODUCT      DATE REVENUES COSTS

编辑:使用ISO月份命名方案以正确排序

使用字母月份名称和年份的命名方案不能正确地按DATE对数据进行排序。 DEC2016FEB2010之前,FEB2010JAN2010之前。 ISO 8601命名约定将年份放在月份编号之前。

我们可以按以下方式使用此命名方案:

months <- names(wide) %>% stringr::str_extract("(?<=REVENUES)\\w*$") %>% na.omit() %>%
  paste0("01", .) %>% lubridate::dmy() %>% format("%Y-%m")
long[, DATE := forcats::lvls_revalue(DATE, months)]
long
      COMPANY   PRODUCT    DATE REVENUES COSTS
 1: COMPANY A PRODUCT 1 2010-01     6400  8500
 2: COMPANY A PRODUCT 2 2010-01     2700  2850
 3: COMPANY B PRODUCT 3 2010-01     5900  4200
 4: COMPANY B PRODUCT 4 2010-01      550   200
 5: COMPANY B PRODUCT 5 2010-01     1500  1850
 6: COMPANY C PRODUCT 6 2010-01    19300 18200
 7: COMPANY A PRODUCT 1 2010-02    11050 10400
 8: COMPANY A PRODUCT 2 2010-02     3000  2400
 9: COMPANY B PRODUCT 3 2010-02     4150  6100
10: COMPANY B PRODUCT 4 2010-02      600   700
11: COMPANY B PRODUCT 5 2010-02     3750  1700
12: COMPANY C PRODUCT 6 2010-02    17250 26950
13: COMPANY A PRODUCT 1 2010-03     6550  9100
14: COMPANY A PRODUCT 2 2010-03     2800  3100
15: COMPANY B PRODUCT 3 2010-03     5750  2950
16: COMPANY B PRODUCT 4 2010-03        0   100
17: COMPANY B PRODUCT 5 2010-03      550  3150
18: COMPANY C PRODUCT 6 2010-03    23600 18200
19: COMPANY A PRODUCT 1 2016-12    10600  9850
20: COMPANY A PRODUCT 2 2016-12     3800  3250
21: COMPANY B PRODUCT 3 2016-12     3750  4600
22: COMPANY B PRODUCT 4 2016-12      650   500
23: COMPANY B PRODUCT 5 2016-12     2100   450
24: COMPANY C PRODUCT 6 2016-12    21250 23900
      COMPANY   PRODUCT    DATE REVENUES COSTS

数据

library(data.table)
wide <- data.table(
readr::read_table(
"  COMPANY   PRODUCT REVENUESJAN2010 REVENUESFEB2010 REVENUESMARCH2010     REVENUESDEC2016 COSTSJAN2010 COSTSFEB2010 COSTSMARCH2010     COSTSDEC2016
COMPANY A PRODUCT 1            6400           11050              6550               10600         8500        10400           9100             9850
COMPANY A PRODUCT 2            2700            3000              2800                3800         2850         2400           3100             3250
COMPANY B PRODUCT 3            5900            4150              5750                3750         4200         6100           2950             4600
COMPANY B PRODUCT 4             550             600                 0                 650          200          700            100              500
COMPANY B PRODUCT 5            1500            3750               550                2100         1850         1700           3150              450
COMPANY C PRODUCT 6           19300           17250             23600               21250        18200        26950          18200            23900"
))

1

使用tidyr的开发版选项(版本为“0.8.3.9000”)

library(dplyr)
library(tidyr)
library(stringr)
library(zoo)
library(readr)

df1 %>% 
   rename_at(3:ncol(.), ~ str_replace(., "^(REVENUES|COSTS)", "\\1_")) %>%
   pivot_longer(c(-COMPANY, -PRODUCT), names_to = c(".value", "DATE"), names_sep = "_") %>% 
   mutate(DATE = format(as.yearmon(DATE), "%b-%Y"))
# A tibble: 24 x 5
#   COMPANY   PRODUCT   DATE     REVENUES COSTS
#   <chr>     <chr>     <chr>       <dbl> <dbl>
# 1 COMPANY A PRODUCT 1 Jan-2010     6400  8500
# 2 COMPANY A PRODUCT 1 Feb-2010    11050 10400
# 3 COMPANY A PRODUCT 1 Mar-2010     6550  9100
# 4 COMPANY A PRODUCT 1 Dec-2016    10600  9850
# 5 COMPANY A PRODUCT 2 Jan-2010     2700  2850
# 6 COMPANY A PRODUCT 2 Feb-2010     3000  2400
# 7 COMPANY A PRODUCT 2 Mar-2010     2800  3100
# 8 COMPANY A PRODUCT 2 Dec-2016     3800  3250
# 9 COMPANY B PRODUCT 3 Jan-2010     5900  4200
#10 COMPANY B PRODUCT 3 Feb-2010     4150  6100
# … with 14 more rows

数据

df1 <- structure(list(COMPANY = c("COMPANY A", "COMPANY A", "COMPANY B", 
"COMPANY B", "COMPANY B", "COMPANY C"), PRODUCT = c("PRODUCT 1", 
"PRODUCT 2", "PRODUCT 3", "PRODUCT 4", "PRODUCT 5", "PRODUCT 6"
), REVENUESJAN2010 = c(6400, 2700, 5900, 550, 1500, 19300), REVENUESFEB2010 = c(11050, 
3000, 4150, 600, 3750, 17250), REVENUESMARCH2010 = c(6550, 2800, 
5750, 0, 550, 23600), REVENUESDEC2016 = c(10600, 3800, 3750, 
650, 2100, 21250), COSTSJAN2010 = c(8500, 2850, 4200, 200, 1850, 
18200), COSTSFEB2010 = c(10400, 2400, 6100, 700, 1700, 26950), 
    COSTSMARCH2010 = c(9100, 3100, 2950, 100, 3150, 18200), COSTSDEC2016 = c(9850, 
    3250, 4600, 500, 450, 23900)), class = c("spec_tbl_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
    cols = list(COMPANY = structure(list(), class = c("collector_character", 
    "collector")), PRODUCT = structure(list(), class = c("collector_character", 
    "collector")), REVENUESJAN2010 = structure(list(), class = c("collector_double", 
    "collector")), REVENUESFEB2010 = structure(list(), class = c("collector_double", 
    "collector")), REVENUESMARCH2010 = structure(list(), class = c("collector_double", 
    "collector")), REVENUESDEC2016 = structure(list(), class = c("collector_double", 
    "collector")), COSTSJAN2010 = structure(list(), class = c("collector_double", 
    "collector")), COSTSFEB2010 = structure(list(), class = c("collector_double", 
    "collector")), COSTSMARCH2010 = structure(list(), class = c("collector_double", 
    "collector")), COSTSDEC2016 = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

0
作为一个从Stata转到R的人,我喜欢在Stata中使用reshape,但是我发现tidyr::gather和tidyr::spread非常直观。Gather基本上是reshape long,而spread则是reshape wide。
以下是代码,它可以将你的数据以你想要的方式进行更改:
new_data <- 
gather(data = your-data-frame, 
       key = var_holder,
       value = val_holder,
       -company,
       -product) 

new_data$var_holder <- sub("REVENUE", "cost_", new_data$var_holder)                                     
new_data$var_holder <- sub("COST", "cost_", new_data$var_holder)

new_data <- 
    separate(data = new_data,
             col = var_holder,
             into = c("var", "date")) %>%
    spread(key = var,
           value = val_holder)

完成了!

gather通过获取所有指定的变量名称(或在此情况下未指定的两个变量前面带有“-”符号),并将它们放置在一个新变量下,该变量的名称由“key = ...”指定(在进行操作时创建新行)。然后,它将落在这些变量下的值放置在一个单独的变量下,该变量的名称由“value = ...”指定。

spread则是相反的方向。希望这可以帮助到您!


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接