数据透视错误:“values_from”中的值没有唯一标识;输出将包含列表列。

75

我的数据长这样:

# A tibble: 6 x 4
  name          val time          x1
  <chr>       <dbl> <date>     <dbl>
1 C Farolillo     7 2016-04-20  51.5
2 C Farolillo     3 2016-04-21  56.3
3 C Farolillo     7 2016-04-22  56.3
4 C Farolillo    13 2016-04-23  57.9
5 C Farolillo     7 2016-04-24  58.7
6 C Farolillo     9 2016-04-25  59.0

我正试图使用pivot_wider函数根据name列展开数据。我使用以下代码:

yy <- d %>% 
  pivot_wider(., names_from = name, values_from = val)

这给了我以下警告消息:

Warning message:
Values in `val` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list(val = list)` to suppress this warning.
* Use `values_fn = list(val = length)` to identify where the duplicates arise
* Use `values_fn = list(val = summary_fun)` to summarise duplicates

输出结果如下:

       time       x1        out1    out2 
    2016-04-20  51.50000    <dbl>   <dbl>
2   2016-04-21  56.34615    <dbl>   <dbl>
3   2016-04-22  56.30000    <dbl>   <dbl>
4   2016-04-23  57.85714    <dbl>   <dbl>
5   2016-04-24  58.70968    <dbl>   <dbl>
6   2016-04-25  58.96774    <dbl>   <dbl>

我知道这里提到了这个问题并建议使用汇总统计数据来解决它。但是,我有时间序列数据,因此不想使用汇总统计数据,因为每天只有一个值(而不是多个值)。

我知道问题是因为val列中有重复值(即在上面的示例中7出现了3次)。

有什么建议可以进行pivot_wider并解决这个问题吗?

数据:

    d <- structure(list(name = c("C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica"), val = c(7, 3, 7, 13, 7, 
9, 20, 19, 4, 5, 5, 2, 6, 6, 16, 13, 7, 6, 3, 3, 6, 10, 5, 3, 
5, 3, 4, 4, 10, 11, 4, 13, 8, 2, 8, 10, 3, 10, 14, 4, 2, 4, 6, 
6, 8, 8, 3, 3, 13, 10, 13, 32, 25, 31, 34, 26, 33, 35, 43, 22, 
22, 21, 10, 33, 33, 48, 47, 27, 23, 11, 13, 25, 31, 20, 16, 10, 
9, 23, 11, 23, 26, 16, 34, 17, 4, 24, 21, 10, 26, 32, 10, 5, 
9, 19, 14, 27, 27, 10, 8, 28, 32, 25), time = structure(c(16911, 
16912, 16913, 16914, 16915, 16916, 16917, 16918, 16919, 16920, 
16921, 16922, 16923, 16923, 16924, 16925, 16926, 16927, 16928, 
16929, 16930, 16931, 16932, 16933, 16934, 16935, 16936, 16937, 
16938, 16939, 16940, 16941, 16942, 16943, 16944, 16945, 16946, 
16947, 16948, 16949, 16950, 16951, 16952, 16953, 16954, 16955, 
16956, 16957, 16958, 16959, 16960, 16911, 16912, 16913, 16914, 
16915, 16916, 16917, 16918, 16919, 16920, 16921, 16922, 16923, 
16923, 16924, 16925, 16926, 16927, 16928, 16929, 16930, 16931, 
16932, 16933, 16934, 16935, 16936, 16937, 16938, 16939, 16940, 
16941, 16942, 16943, 16944, 16945, 16946, 16947, 16948, 16949, 
16950, 16951, 16952, 16953, 16954, 16955, 16956, 16957, 16958, 
16959, 16960), class = "Date"), x1 = c(51.5, 56.3461538461538, 
56.3, 57.8571428571429, 58.7096774193548, 58.9677419354839, 64.4615384615385, 
61.9310344827586, 60.3214285714286, 59.4137931034483, 59.5806451612903, 
57.3448275862069, 64.0333333333333, 64.0333333333333, 70.15625, 
71.3636363636364, 62.8125, 56.4375, 56.4516129032258, 51.741935483871, 
52.84375, 53.09375, 52.969696969697, 54, 54.3870967741936, 60.3870967741936, 
64.4516129032258, 66.2903225806452, 68.2333333333333, 69.7741935483871, 
70.5806451612903, 73.8275862068966, 72.8181818181818, 64.6764705882353, 
64.4838709677419, 68.7741935483871, 62.1764705882353, 68.969696969697, 
70.1935483870968, 59.6774193548387, 59.9677419354839, 63.125, 
67.5882352941177, 71.4705882352941, 73.8529411764706, 76.1935483870968, 
72.6451612903226, 76.0645161290323, 76.4193548387097, 81.7741935483871, 
85.0645161290323, 51.5, 56.3461538461538, 56.3, 57.8571428571429, 
58.7096774193548, 58.9677419354839, 64.4615384615385, 61.9310344827586, 
60.3214285714286, 59.4137931034483, 59.5806451612903, 57.3448275862069, 
64.0333333333333, 64.0333333333333, 70.15625, 71.3636363636364, 
62.8125, 56.4375, 56.4516129032258, 51.741935483871, 52.84375, 
53.09375, 52.969696969697, 54, 54.3870967741936, 60.3870967741936, 
64.4516129032258, 66.2903225806452, 68.2333333333333, 69.7741935483871, 
70.5806451612903, 73.8275862068966, 72.8181818181818, 64.6764705882353, 
64.4838709677419, 68.7741935483871, 62.1764705882353, 68.969696969697, 
70.1935483870968, 59.6774193548387, 59.9677419354839, 63.125, 
67.5882352941177, 71.4705882352941, 73.8529411764706, 76.1935483870968, 
72.6451612903226, 76.0645161290323, 76.4193548387097, 81.7741935483871, 
85.0645161290323)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-102L))
6个回答

124
为每个 name 创建一个唯一标识符行,然后使用 pivot_wider
library(dplyr)

d %>%
  group_by(name) %>%
  mutate(row = row_number()) %>%
  tidyr::pivot_wider(names_from = name, values_from = val) %>%
  select(-row)

# A tibble: 51 x 4
#   time          x1 `C Farolillo` `Plaza Eliptica`
#   <date>     <dbl>         <dbl>            <dbl>
# 1 2016-04-20  51.5             7               32
# 2 2016-04-21  56.3             3               25
# 3 2016-04-22  56.3             7               31
# 4 2016-04-23  57.9            13               34
# 5 2016-04-24  58.7             7               26
# 6 2016-04-25  59.0             9               33
# 7 2016-04-26  64.5            20               35
# 8 2016-04-27  61.9            19               43
# 9 2016-04-28  60.3             4               22
#10 2016-04-29  59.4             5               22
# … with 41 more rows

1
这给了我一个错误:在select()中出错: !无法对不存在的列进行子集操作。 ✖ 列row不存在。 - amy989

27

通常这个错误

Warning message:
Values in `val` are not uniquely identified; output will contain list-cols.

最常见的原因是数据中存在重复行(在排除val列后),而不是val列中的重复项。

which(duplicated(d))
# [1] 14 65

OP的数据似乎有两行重复,这导致了这个问题。删除重复行也可以消除错误。

yy <- d %>% distinct() %>% pivot_wider(., names_from = name, values_from = val)
yy
# A tibble: 50 x 4
   time          x1 `C Farolillo` `Plaza Eliptica`
   <date>     <dbl>         <dbl>            <dbl>
 1 2016-04-20  51.5             7               32
 2 2016-04-21  56.3             3               25
 3 2016-04-22  56.3             7               31
 4 2016-04-23  57.9            13               34
 5 2016-04-24  58.7             7               26
 6 2016-04-25  59.0             9               33
 7 2016-04-26  64.5            20               35
 8 2016-04-27  61.9            19               43
 9 2016-04-28  60.3             4               22
10 2016-04-29  59.4             5               22
# ... with 40 more rows

我不会称另一种解决方案为“快速/肮脏修复”,因为有许多有效情况下,如果允许每个时间点有多个值,则这是正确的方法,但由于OP说每个时间点只应有一个值,你的解决方案解决了重复条目的问题。 - Gilean0709
同意,如果有行仅在值列中不同,那么它可能非常有用。 - Ameer
在数据集中删除重复行将导致我丢失时间序列信息。数据包含两个不同的时间序列“C Farolillo”和“Plaza Eliptica”,它们恰好在同一天具有相同的值。这不是真正的重复,只是一个巧合。 - user113156
尝试使用d[c(13,14),],将得到以下两行数据: [1] 13 C Farolillo 6 2016-05-02 64.03333 [2] 14 C Farolillo 6 2016-05-02 64.03333。这是同一天内C Farolillo的两个相同观测值;因此在我看来它们是重复的。再使用d[c(64,65),]进行另一组观测值的检查。 - Ameer
我认为这是正确的答案。在透视之前,我们可以以某种方式聚合重复的行。例如,我们可以使用 group_by(name, x1) %>% summarise(x1 = sum(x1)) 或在 sum 的位置使用 mean。至少,在实践中,这是我通常遇到的用例。 - passerby51

12
问题的原因在于您想要展开/转置的数据具有重复标识符。虽然上面两个建议,即使用mutate(row = row_number())从行号创建唯一的人工ID或仅过滤distinct行将允许您进行更宽的枢轴,但它们会改变您的表格结构,很可能会出现逻辑组织问题,在下一次尝试将任何内容连接到它时会暴露出来。
更好的做法是明确使用id_cols参数,以查看您在转换后实际上希望成为唯一的内容,并且如果遇到问题,请先重新组织原始表格。当然,您可能会发现需要过滤不同的行或添加新的ID,但最好在代码早期避免重复。

1
我遇到了类似于上述问题,但这些解决方案似乎都不适用于我。我有重复的值,最可能是因为我的数据涉及不同时间点的不同评分。我尝试使用id_cols,但这也无效。 - Con Des
2
在这种情况下,显然你的观察结果必须是独一无二的,其中包括时间。因此,id_cols 必须考虑所有可能的时间观测值。实现这一点的方法之一是将 <your_id>_<time> 合并为单个 ID,或明确创建唯一的行 ID。 - Daniel Antal
我已经尝试过这个,但不确定如何在使用pivot_wider之前首先以长格式进行操作。由于某种原因,同一ID号有时会分配给两个观测值。 - Con Des
所以我不想摆脱重复项,相反,我想更改重复的ID号码。 - Con Des
@ConDes 你尝试过类似这样的代码吗:df_wide %>% group_by(old_ID, time_point) %>% mutate(new_ID = paste0(old_ID, "_", 0:n())) - fabern

8

虽然在 OP 的示例中看不到,但是在某些情况下,被接受的答案会重复行,而并非必要。以下方法可以避免这种情况:

d %>%
  pivot_wider(names_from = name, values_from = val
             , values_fn = list) %>% 
    unnest(cols = everything() )

如果存在列表,为避免警告和错误,请将 values_fn = list
示例:
d1 <- tail(d)[1:5,]
d5<-d1
d5$name<-"some"
withlist  <- tibble(d1, l = list(c(1,2),c(1,2),c(1,2),c(1,2),c(1,2) ) )
withlist2 <- tibble(d5, l = list( list(1,2),list(1,2),list(1,2,3),list(1,2),list(1,2) ) )
withl     <- rbind(withlist,withlist2)

res<-withl %>%
  pivot_wider(names_from = name, values_from = l
              , values_fn = list)     
as.data.frame(res)

#  val       time    x1 Plaza Eliptica    some
#1  27 2016-06-03 76.19           1, 2    1, 2
#2  10 2016-06-04 72.65           1, 2    1, 2
#3   8 2016-06-05 76.06           1, 2 1, 2, 3
#4  28 2016-06-06 76.42           1, 2    1, 2
#5  32 2016-06-07 81.77           1, 2    1, 2

-2

虽然有点晚了,但是仍然有一个选项可以保留非唯一的观测值,但仍然进行数据透视:

table(d$name) # get the unique names_from and frequencies
# 
#    C Farolillo Plaza Eliptica 
#             51             51  

(d2 <- d %>% mutate(rno = rep(1:51, 2)) %>% 
                  # repeat 1:51 2 times; unique id by names_from

      pivot_wider(names_from = name, values_from = val))
    # # A tibble: 51 × 5
    #    time          x1   rno `C Farolillo` `Plaza Eliptica`
    #    <date>     <dbl> <int>         <dbl>            <dbl>
    #  1 2016-04-20  51.5     1             7               32
    #  2 2016-04-21  56.3     2             3               25
    #  3 2016-04-22  56.3     3             7               31
    #  4 2016-04-23  57.9     4            13               34
    #  5 2016-04-24  58.7     5             7               26
    #  6 2016-04-25  59.0     6             9               33
    #  7 2016-04-26  64.5     7            20               35
    #  8 2016-04-27  61.9     8            19               43
    #  9 2016-04-28  60.3     9             4               22
    # 10 2016-04-29  59.4    10             5               22
    # # … with 41 more rows 

-4
我猜你的数据集中出现了无意中的重复。 第13/14行是完全相同的观察结果。只需更正数据集即可。 您可以查看您的d和yy数据集以查看问题所在的观察结果。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接