数据管理:使用R展平数据

3

我有以下数据框记录政策的演变:

Df <- data.frame(Id_policy = c("A_001", "A_002", "A_003","B_001","B_002"),
                 date_new = c("20200101","20200115","20200304","20200110","20200215"),
                 date_end = c("20200503","20200608","20210101","20200403","20200503"),
                 expend = c("","A_001","A_002","",""))

它看起来像这样:

  Id_policy date_new date_end expend
     A_001 20200101 20200503       
     A_002 20200115 20200608  A_001
     A_003 20200304 20210101  A_002
     B_001 20200110 20200403       
     B_002 20200215 20200503       

"Id_policy"指特定政策,"date_new"是政策发布日期,"date_end"是政策结束日期。但有时政策会被延长。在这种情况下,将设置一个新政策,并且变量"expend"提供更改前政策的名称。
这里的想法是压缩数据集,只保留与不同政策对应的行。因此,输出将类似于以下内容:
  Id_policy date_new date_end expend
     A_001 20200101 20210101       
     B_001 20200110 20200403       
     B_002 20200215 20200503     

是否有人遇到过类似的问题?

3个回答

3
一种方法是将其视为网络问题,并使用igraph函数(相关帖子,例如 Make a group_indices based on several columns ; Fast way to group variables based on direct and indirect similarities in multiple columns )。
  1. 将缺失的“expend”设置为“Id_policy”

  2. 使用graph_from_data_frame创建图形,其中'exped'和'Id_policy'列被视为边缘列表。

  3. 使用components获取图的连通组件,即哪些'Id_policy'是直接或间接连接的。

  4. 选择membership元素以获取“每个顶点所属的群集ID”。

  5. 将成员身份加入原始数据。

  6. 获取按成员身份分组的相关数据。

我在数据整理步骤中使用data.table,但当然也可以在basedplyr中完成。

library(data.table)
library(igraph)

setDT(Df)
Df[expend ==  "", expend := Id_policy]

g = graph_from_data_frame(Df[ , .(expend, Id_policy)])
mem = components(g)$membership

Df[.(names(mem)), on = .(Id_policy), mem := mem]    

Df[ , .(Id_policy = Id_policy[1],
        date_new = first(date_new),
        date_end = last(date_end), by = mem]
#    mem Id_policy date_new date_end
# 1:   1     A_001 20200101 20210101
# 2:   2     B_001 20200110 20200403
# 3:   3     B_002 20200215 20200503

2
这里提供一种解决方案,使用igraph创建ID的有向网络,并使用data.table进行绑定和连接。我保留了结果之间的内容以展示每个步骤的作用。
library( data.table )
library( igraph )
setDT(Df)
#create nodes and links
nodes <- Df[,1:3]
links <- Df[ !expend == "", .(from = expend, to = Id_policy) ]
g = graph_from_data_frame( links, vertices = nodes, directed = TRUE )
plot(g)

enter image description here

#find nodes without incoming (these are startpoints of paths)
in.nodes <- V(g)[degree(g, mode = 'in') == 0]
#define sumcomponents of the graph by looping the in.nodes
L <- lapply( in.nodes, function(x) names( subcomponent(g, x) ) )
# $A_001
# [1] "A_001" "A_002" "A_003"
# $B_001
# [1] "B_001"
# $B_002
# [1] "B_002"
L2 <- lapply( L, function(x) {
  #get first and last element
  dt <- data.table( start = x[1], end = x[ length(x) ] )
})
#bind list together to a single data.table
ans <- rbindlist( L2, use.names = TRUE, fill = TRUE, idcol = "Id_policy" )
#    Id_policy start   end
# 1:     A_001 A_001 A_003
# 2:     B_001 B_001 B_001
# 3:     B_002 B_002 B_002

#update ans with values from original Df for start and end
ans[ Df, `:=`( start = i.date_new ), on = .(start = Id_policy) ][]
ans[ Df, `:=`( end   = i.date_end ), on = .(end = Id_policy) ][]
# Id_policy    start      end
# 1:     A_001 20200101 20210101
# 2:     B_001 20200110 20200403
# 3:     B_002 20200215 20200503

0
使用一个外部for循环来遍历Df中的每个政策ID,并结合一个内部while循环来找到原始政策的最后一个扩展应该就可以了。
Df <- data.frame(Id_policy = c("A_001", "A_002", "A_003","B_001","B_002"),
                 date_new = c("20200101","20200115","20200304","20200110","20200215"),
                 date_end = c("20200503","20200608","20210101","20200403","20200503"),
                 expend = c("","A_001","A_002","",""),
                 stringsAsFactors = F)
final_df <- data.frame(matrix(nrow = 0, ncol = 0), stringsAsFactors = F)

for (i in seq_len(nrow(Df))) {
  # Check to see if the current policy ID is in the column expend
  if (Df$Id_policy[i] %in% Df$expend || !Df$expend[i] == "") {
    # Loop through expend policy until last one is found
    found_last <- F
    j <- i
    end_date <- ""

    c_policy_id <- Df$Id_policy[j]
    expended_id <- Df$Id_policy[which(Df$expend == c_policy_id)]

    if (length(expended_id) > 0) {
      if (expended_id %in% Df$expend) {
        while(!found_last) {
          c_policy_id <- Df$Id_policy[j]
          expended_id <- Df$Id_policy[which(Df$expend == c_policy_id)]

          if (length(expended_id) > 0) {
            if (expended_id %in% Df$expend) {

              j <- which(Df$expend == expended_id)
            }
          }else{
            end_date <- Df$date_end[j]
            found_last <- T
          }
        }
        if (!end_date == "") {
          # Add to final df when found the last one
          final_df <- bind_rows(final_df, data.frame(Id_policy = Df$Id_policy[i],
                                                     date_new = Df$date_new[i],
                                                     date_end = end_date,
                                                     expend = ""))
        }
      }
    }
  }else{
    final_df <- bind_rows(final_df, Df[i, ])

  }
}

final_df

 Id_policy date_new date_end expend
1     A_001 20200101 20210101       
2     B_001 20200110 20200403       
3     B_002 20200215 20200503

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接