将Json文件读入数据框中,不包含嵌套列表。

23

我想在r中将一个json文件加载到data.frame中。我已经尝试了jsonlite包中的fromJSON函数,但是得到的是嵌套列表,不知道如何将输入扁平化成二维data.frame。Jsonlite将文件读取为data.frame,但有些变量中仍保留着嵌套的列表。

有没有人有关于当读入嵌套列表的JSON文件时,如何将其加载到data.frame中的任何提示呢?

#*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*# HERE IS MY EXAMPLE #*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*#
# loads the packages
library("httr")
library( "jsonlite")

# downloads an example file
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE ) 

# the flatten function breaks the name variable into three vars ( first name, middle name, last name)
providers <- flatten( providers )

# but many of the columns are still lists:
sapply( providers , class)

# Some of these lists have a single level
head( providers$facility_type )

# Some have lot more than two - for example nine
providers[ , 6][[1]]

我希望你能为每个npi提供一行,然后为各个列表的切片分别设置列 - 以便数据框具有“plan_id_type”、“plan_id”、“network_tier”九次的列,可能是从0到8的列名。我已经能够使用此网站:http://www.convertcsv.com/json-to-csv.htm将此文件转换为二维文件,但由于我正在处理数百个文件,因此我希望能够动态地完成它。这是文件:http://s000.tinyupload.com/download.php?file_id=10808537503095762868&t=1080853750309576286812811 - 我想使用fromJson函数将具有此结构的文件加载为数据框。
以下是我尝试过的一些方法; 因此,我考虑了两种方法; 首先:使用不同的函数读取Json文件,我看了一下
rjson but that reads in a list
library( rjson )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
class( providers )

我已经尝试过RJSONIO - 我尝试了这个在R中将导入的json数据转换为数据框

json-data-into-a-data-frame-in-r
library( RJSONIO )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )

json_file <- lapply(providers, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})

# but When converting the lists to a data.frame I get an error
a <- do.call("rbind", json_file)

所以,我尝试的第二种方法是将所有列表转换为数据框中的变量。
detach("package:RJSONIO", unload = TRUE )
detach("package:rjson", unload = TRUE )

library( "jsonlite")
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE ) 
providers <- flatten( providers )

我可以提取其中一个列表,但由于缺失值,我无法将其合并回我的数据框中。

a <- data.frame(Reduce(rbind,  providers$facility_type))
length( a ) == nrow( providers )

我也尝试了这些建议:将嵌套列表转换为数据框。以及其他一些东西,但没有成功。

a <- sapply( providers$facility_type, unlist )
as.data.frame(t(sapply( providers$providers, unlist )) )

任何帮助都会受到欢迎。
4个回答

17

更新:2016年2月21日

col_fixer已更新,新增了vec2col参数,可以将列表列展开为单个字符串或一组列。


在您下载的data.frame中,我看到有几种不同的列类型。有普通列,由相同类型的向量组成。有列表列,其中项可能是NULL或本身是一个扁平向量。有列表列,其中有data.frame作为列表元素。有列表列,其中包含与主data.frame行数相同的data.frame

以下是重现这些条件的示例数据集:

mydf <- data.frame(id = 1:3, type = c("A", "A", "B"), 
                   facility = I(list(c("x", "y"), NULL, "x")),
  address = I(list(data.frame(v1 = 1, v2 = 2, v4 = 3), 
                   data.frame(v1 = 1:2, v2 = 3:4, v3 = 5), 
                   data.frame(v1 = 1, v2 = NA, v3 = 3))))

mydf$person <- data.frame(name = c("AA", "BB", "CC"), age = c(20, 32, 23),
                          preference = c(TRUE, FALSE, TRUE))

这个样例的data.framestr如下所示:

str(mydf)
## 'data.frame':    3 obs. of  5 variables:
##  $ id      : int  1 2 3
##  $ type    : Factor w/ 2 levels "A","B": 1 1 2
##  $ facility:List of 3
##   ..$ : chr  "x" "y"
##   ..$ : NULL
##   ..$ : chr "x"
##   ..- attr(*, "class")= chr "AsIs"
##  $ address :List of 3
##   ..$ :'data.frame': 1 obs. of  3 variables:
##   .. ..$ v1: num 1
##   .. ..$ v2: num 2
##   .. ..$ v4: num 3
##   ..$ :'data.frame': 2 obs. of  3 variables:
##   .. ..$ v1: int  1 2
##   .. ..$ v2: int  3 4
##   .. ..$ v3: num  5 5
##   ..$ :'data.frame': 1 obs. of  3 variables:
##   .. ..$ v1: num 1
##   .. ..$ v2: logi NA
##   .. ..$ v3: num 3
##   ..- attr(*, "class")= chr "AsIs"
##  $ person  :'data.frame':    3 obs. of  3 variables:
##   ..$ name      : Factor w/ 3 levels "AA","BB","CC": 1 2 3
##   ..$ age       : num  20 32 23
##   ..$ preference: logi  TRUE FALSE TRUE
## NULL

一种将其“平铺”(即将嵌套的表拍扁)的方法是“修复”列表列。有三种修复方式:

  1. flatten(来自“jsonlite”)将处理像“person”列这样的列。
  2. 使用 toString 可以修复像 "facility" 列这样的列,它会将每个元素转换为逗号分隔的项或将其转换为多个列。
  3. 对于存在data.frame、并带有多行的列,首先需要将其“按宽格式转换”为单行,然后需要将其绑定在一起成为一个单一的 data.table。(我使用 "data.table" 进行重塑和行绑定)。

我们可以使用以下函数来处理第二点和第三点:

col_fixer <- function(x, vec2col = FALSE) {
  if (!is.list(x[[1]])) {
    if (isTRUE(vec2col)) {
      as.data.table(data.table::transpose(x))
    } else {
      vapply(x, toString, character(1L))
    }
  } else {
    temp <- rbindlist(x, use.names = TRUE, fill = TRUE, idcol = TRUE)
    temp[, .time := sequence(.N), by = .id]
    value_vars <- setdiff(names(temp), c(".id", ".time"))
    dcast(temp, .id ~ .time, value.var = value_vars)[, .id := NULL]
  }
}

我们将把那个函数和flatten函数集成在另一个函数中,该函数将处理大部分内容。

Flattener <- function(indf, vec2col = FALSE) {
  require(data.table)
  require(jsonlite)
  indf <- flatten(indf)
  listcolumns <- sapply(indf, is.list)
  newcols <- do.call(cbind, lapply(indf[listcolumns], col_fixer, vec2col))
  indf[listcolumns] <- list(NULL)
  cbind(indf, newcols)
}

运行该函数会得到以下结果:

Flattener(mydf)
##   id type person.name person.age person.preference facility address.v1_1
## 1  1    A          AA         20              TRUE     x, y            1
## 2  2    A          BB         32             FALSE                     1
## 3  3    B          CC         23              TRUE        x            1
##   address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2 address.v3_1
## 1           NA            2           NA            3           NA           NA
## 2            2            3            4           NA           NA            5
## 3           NA           NA           NA           NA           NA            3
##   address.v3_2
## 1           NA
## 2            5
## 3           NA

或者,将向量分别放入列中:

Flattener(mydf, TRUE)
##   id type person.name person.age person.preference facility.V1 facility.V2
## 1  1    A          AA         20              TRUE           x           y
## 2  2    A          BB         32             FALSE        <NA>        <NA>
## 3  3    B          CC         23              TRUE           x        <NA>
##   address.v1_1 address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2
## 1            1           NA            2           NA            3           NA
## 2            1            2            3            4           NA           NA
## 3            1           NA           NA           NA           NA           NA
##   address.v3_1 address.v3_2
## 1           NA           NA
## 2            5            5
## 3            3           NA

这里是 str 的内容:

str(Flattener(mydf))
## 'data.frame':    3 obs. of  14 variables:
##  $ id               : int  1 2 3
##  $ type             : Factor w/ 2 levels "A","B": 1 1 2
##  $ person.name      : Factor w/ 3 levels "AA","BB","CC": 1 2 3
##  $ person.age       : num  20 32 23
##  $ person.preference: logi  TRUE FALSE TRUE
##  $ facility         : chr  "x, y" "" "x"
##  $ address.v1_1     : num  1 1 1
##  $ address.v1_2     : num  NA 2 NA
##  $ address.v2_1     : num  2 3 NA
##  $ address.v2_2     : num  NA 4 NA
##  $ address.v4_1     : num  3 NA NA
##  $ address.v4_2     : num  NA NA NA
##  $ address.v3_1     : num  NA 5 3
##  $ address.v3_2     : num  NA 5 NA
## NULL

在您的“providers”对象上,此操作非常快速一致:

library(microbenchmark)
out <- microbenchmark(Flattener(providers), Flattener(providers, TRUE), flattenList(jsonRList))
out
# Unit: milliseconds
#                        expr        min         lq      mean    median        uq       max neval
#        Flattener(providers)  104.18939  126.59295  157.3744  138.4185  174.5222  308.5218   100
#  Flattener(providers, TRUE)   67.56471   86.37789  109.8921   96.3534  121.4443  301.4856   100
#      flattenList(jsonRList) 1780.44981 2065.50533 2485.1924 2269.4496 2694.1487 4397.4793   100

library(ggplot2)
qplot(y = time, data = out, colour = expr) ## Via @TylerRinker

enter image description here


我找不到函数jsonRList,也找不到它的在线文档。 - Gabriel Fair

14

我的第一步是通过RCurl :: getURL()rjson :: fromJSON()加载数据,与您第二个代码示例相同:

##--------------------------------------
## libraries
##--------------------------------------
library(rjson);
library(RCurl);

##--------------------------------------
## get data
##--------------------------------------
URL <- 'https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json';
jsonRList <- fromJSON(getURL(URL)); ## recursive list representing the original JSON data

接下来,为了深入了解数据的结构和清洁度,我编写了一组辅助函数:

##--------------------------------------
## helper functions
##--------------------------------------
## apply a function to a set of nodes at the same depth level in a recursive list structure
levelApply <- function(
    nodes, ## the root node of the list (recursive calls pass deeper nodes as they drill down into the list)
    keyList, ## another list, expected to hold a sequence of keys (component names, integer indexes, or NULL for all) specifying which nodes to select at each depth level
    func=identity, ## a function to run separately on each node once keyList has been exhausted
    ..., ## further arguments passed to func()
    joinFunc=NULL ## optional function for joining the return values of func() at each successive depth, as the stack is unwound. An alternative is calling unlist() on the result, but careful not to lose the top-level index association
) {
    if (length(keyList) == 0L) {
        ret <- if (is.null(nodes)) NULL else func(nodes,...)
    } else if (is.null(keyList[[1L]]) || length(keyList[[1L]]) != 1L) {
        ret <- lapply(if (is.null(keyList[[1L]])) nodes else nodes[keyList[[1L]]],levelApply,keyList[-1L],func,...,joinFunc=joinFunc);
        if (!is.null(joinFunc))
            ret <- do.call(joinFunc,ret);
    } else {
        ret <- levelApply(nodes[[keyList[[1L]]]],keyList[-1L],func,...,joinFunc=joinFunc);
    }; ## end if
    ret;
}; ## end if
## these two wrappers automatically attempt to simplify the results of func() to a vector or matrix/data.frame, respectively
levelApplyToVec <- function(...) levelApply(...,joinFunc=c);
levelApplyToFrame <- function(...) levelApply(...,joinFunc=rbind); ## can return matrix or data.frame, depending on ret

理解上述内容的关键在于keyList参数。假设你有如下列表:
list(NULL,'addresses',2:3,'city')

那将选择所有主列表下地址列表下第二个和第三个地址元素下的城市字符串。在R中没有内置的应用函数可以操作这样的“并行”节点选择(rapply()接近,但不完全相同),这就是为什么我编写了自己的函数。levelApply()查找每个匹配的节点并在其上运行给定的func()(默认为identity(),因此返回节点本身),将结果作为joinFunc()连接或以与输入列表中存在的那些节点相同的递归列表结构返回给调用者。快速演示:
unname(levelApplyToVec(jsonRList,list(4L,'addresses',1:2,c('address','city'))));
## [1] "1001 Noble St"  "Fairbanks"      "1650 Cowles St" "Fairbanks"

以下是我在解决这个问题的过程中编写的其他辅助函数:
## for the given node selection key union, retrieve a data.frame of logicals representing the unique combinations of keys possessed by the selected nodes, possibly with a count
keyCombos <- function(node,keyList,allKeys) `rownames<-`(setNames(unique(as.data.frame(levelApplyToFrame(node,keyList,function(h) allKeys%in%names(h)))),allKeys),NULL);
keyCombosWithCount <- function(node,keyList,allKeys) { ks <- keyCombos(node,keyList,allKeys); ks$.count <- unname(apply(ks,1,function(combo) sum(levelApplyToVec(node,keyList,function(h) identical(sort(names(ks)[combo]),sort(names(h))))))); ks; };

## return a simple two-component list with type (list, namedlist, or atomic vector type) and len for non-namedlist types; tlStr() returns a nice stringified form of said list
tl <- function(e) { if (is.null(e)) return(NULL); ret <- typeof(e); if (ret == 'list' && !is.null(names(e))) ret <- list(type='namedlist') else ret <- list(type=ret,len=length(e)); ret; };
tlStr <- function(e) { if (is.null(e)) return(NA); ret <- tl(e); if (is.null(ret$len)) ret <- ret$type else ret <- paste0(ret$type,'[',ret$len,']'); ret; };

## stringification functions for display
mkcsv <- function(v) paste0(collapse=',',v);
keyListToStr <- function(keyList) paste0(collapse='','/',sapply(keyList,function(key) if (is.null(key)) '*' else paste0(collapse=',',key)));

## return a data.frame giving a comma-separated list of the unique types possessed by the selected nodes; useful for learning about the structure of the data
keyTypes <- function(node,keyList,allKeys) data.frame(key=allKeys,tl=sapply(allKeys,function(key) mkcsv(unique(na.omit(levelApplyToVec(node,c(keyList,key),tlStr))))),row.names=NULL);

## useful for testing; can call npiToFrame() to show the row with a specified npi value, in a nice vertical form
rowToFrame <- function(dfrow) data.frame(column=names(dfrow),value=c(as.matrix(dfrow)));
getNPIRow <- function(df,npi) which(df$npi == npi);
npiToFrame <- function(df,npi) rowToFrame(df[getNPIRow(df,npi),]);

我试图记录下我最初检查该数据时运行的命令序列。以下是结果,显示了我运行的命令、命令输出以及前导注释,描述了我的意图,以及我从输出中得出的结论:

##--------------------------------------
## data examination
##--------------------------------------
## type of object -- plain unnamed list => array, length 3256
levelApplyToVec(jsonRList,list(),tlStr);
## [1] "list[3256]"

## unique types of main array elements => all named lists => hashes
unique(levelApplyToVec(jsonRList,list(NULL),tlStr));
## [1] "namedlist"

## get the union of keys among all hashes
allKeys <- unique(levelApplyToVec(jsonRList,list(NULL),names)); allKeys;
##  [1] "npi"             "type"            "facility_name"   "facility_type"   "addresses"       "plans"           "last_updated_on" "name"            "speciality"      "accepting"       "languages"       "gender"

## get the unique pattern of keys among all hashes, and how often each occurs => shows there are inconsistent key sets among the top-level hashes
keyCombosWithCount(jsonRList,list(NULL),allKeys);
##    npi type facility_name facility_type addresses plans last_updated_on  name speciality accepting languages gender .count
## 1 TRUE TRUE          TRUE          TRUE      TRUE  TRUE            TRUE FALSE      FALSE     FALSE     FALSE  FALSE    279
## 2 TRUE TRUE         FALSE         FALSE      TRUE  TRUE            TRUE  TRUE       TRUE      TRUE      TRUE   TRUE   2973
## 3 TRUE TRUE         FALSE         FALSE      TRUE  TRUE            TRUE  TRUE       TRUE      TRUE      TRUE  FALSE      4

## for each key, get the unique set of types it takes on among all hashes, ignoring hashes where the key is omitted => some scalar strings, some multi-string, addresses is a variable-length list, plans is length-9 list, and name is a hash
keyTypes(jsonRList,list(NULL),allKeys);
##                key                                                                                        tl
## 1              npi                                                                              character[1]
## 2             type                                                                              character[1]
## 3    facility_name                                                                              character[1]
## 4    facility_type                                                    character[1],character[2],character[3]
## 5        addresses list[1],list[2],list[3],list[6],list[5],list[7],list[4],list[8],list[9],list[13],list[12]
## 6            plans                                                                                   list[9]
## 7  last_updated_on                                                                              character[1]
## 8             name                                                                                 namedlist
## 9       speciality                                       character[1],character[2],character[3],character[4]
## 10       accepting                                                                              character[1]
## 11       languages                          character[2],character[3],character[4],character[6],character[5]
## 12          gender                                                                              character[1]

## must look deeper into addresses array, plans array, and name hash; we'll have to flatten them

## ==== addresses =====
## note: the addresses key is always present under main array elements
## unique types of address elements across all hashes => all named lists, thus nested hashes
unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),tlStr));
## [1] "namedlist"

## union of keys among all address element hashes
allAddressKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),names)); allAddressKeys;
## [1] "address"   "city"      "state"     "zip"       "phone"     "address_2"

## pattern of keys among address elements => only address_2 varies, similar frequency with it as without it
keyCombosWithCount(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
##   address city state  zip phone address_2 .count
## 1    TRUE TRUE  TRUE TRUE  TRUE     FALSE   1898
## 2    TRUE TRUE  TRUE TRUE  TRUE      TRUE   2575

## for each address element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only address_2 in this case) is omitted => all scalar strings
keyTypes(jsonRList,list(NULL,'addresses',NULL),allAddressKeys);
##         key           tl
## 1   address character[1]
## 2      city character[1]
## 3     state character[1]
## 4       zip character[1]
## 5     phone character[1]
## 6 address_2 character[1]

## ==== plans =====
## note: the plans key is always present under main array elements
## unique types of plan elements across all hashes => all named lists, thus nested hashes
unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),tlStr));
## [1] "namedlist"

## union of keys among all plan element hashes
allPlanKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),names)); allPlanKeys;
## [1] "plan_id_type" "plan_id"      "network_tier"

## pattern of keys among plan elements => good, all plan elements have all 3 keys, perfectly consistent
keyCombosWithCount(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
##   plan_id_type plan_id network_tier .count
## 1         TRUE    TRUE         TRUE  29304

## for each plan element key, get the unique set of types it takes on among all hashes (note: no plan keys are ever omitted, so don't have to worry about that) => all scalar strings
keyTypes(jsonRList,list(NULL,'plans',NULL),allPlanKeys);
##            key           tl
## 1 plan_id_type character[1]
## 2      plan_id character[1]
## 3 network_tier character[1]

## ==== name =====
## note: the name key is *not* always present under main array elements
## union of keys among all name hashes
allNameKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'name'),names)); allNameKeys;
## [1] "first"  "middle" "last"

## pattern of keys among name elements => sometimes middle is missing, relatively infrequently
keyCombosWithCount(jsonRList,list(NULL,'name'),allNameKeys);
##   first middle last .count
## 1  TRUE   TRUE TRUE   2679
## 2  TRUE  FALSE TRUE    298

## for each name element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only middle in this case) is omitted => all scalar strings
keyTypes(jsonRList,list(NULL,'name'),allNameKeys);
##      key           tl
## 1  first character[1]
## 2 middle character[1]
## 3   last character[1]

这里是数据的摘要:
- 一个顶级主列表,长度为3256。 - 每个元素都是一个哈希表,键集不一致。在所有主哈希表中总共有12个键,有3种不同的键集模式。 - 其中6个哈希值是标量字符串,3个是可变长度的字符串向量,`addresses`是可变长度的列表,`plans`是长度始终为9的列表,而`name`是一个哈希表。 - 每个`addresses`列表元素都是一个哈希表,具有5或6个键到标量字符串,其中`address_2`是不一致的。 - 每个`plans`列表元素都是一个哈希表,具有3个键到标量字符串,没有不一致之处。 - 每个`name`哈希表都有`first`和`last`,但不一定有`middle`标量字符串。
最重要的观察结果是,在并行节点之间没有类型不一致(除了省略和长度差异)的情况。这意味着我们可以将所有并行节点组合成向量,而无需考虑类型强制转换。只要将所有列与足够深的节点相关联,使得所有列对应于输入列表中的单个标量字符串节点,就可以将所有数据展平为二维结构。
下面是我的解决方案。请注意,它依赖于我之前定义的辅助函数`tl()`、`keyListToStr()`和`mkcsv()`。
##--------------------------------------
## solution
##--------------------------------------
## recursively traverse the list structure, building up a column at each leaf node
extractLevelColumns <- function(
    nodes, ## current level node selection
    ..., ## additional arguments to data.frame()
    keyList=list(), ## current key path under main list
    sep=NULL, ## optional string separator on which to join multi-element vectors; if NULL, will leave as separate columns
    mkname=function(keyList,maxLen) paste0(collapse='.',if (is.null(sep) && maxLen == 1L) keyList[-length(keyList)] else keyList) ## name builder from current keyList and character vector max length across node level; default to dot-separated keys, and remove last index component for scalars
) {
    cat(sprintf('extractLevelColumns(): %s\n',keyListToStr(keyList)));
    if (length(nodes) == 0L) return(list()); ## handle corner case of empty main list
    tlList <- lapply(nodes,tl);
    typeList <- do.call(c,lapply(tlList,`[[`,'type'));
    if (length(unique(typeList)) != 1L) stop(sprintf('error: inconsistent types (%s) at %s.',mkcsv(typeList),keyListToStr(keyList)));
    type <- typeList[1L];
    if (type == 'namedlist') { ## hash; recurse
        allKeys <- unique(do.call(c,lapply(nodes,names)));
        ret <- do.call(c,lapply(allKeys,function(key) extractLevelColumns(lapply(nodes,`[[`,key),...,keyList=c(keyList,key),sep=sep,mkname=mkname)));
    } else if (type == 'list') { ## array; recurse
        lenList <- do.call(c,lapply(tlList,`[[`,'len'));
        maxLen <- max(lenList,na.rm=T);
        allIndexes <- seq_len(maxLen);
        ret <- do.call(c,lapply(allIndexes,function(index) extractLevelColumns(lapply(nodes,function(node) if (length(node) < index) NULL else node[[index]]),...,keyList=c(keyList,index),sep=sep,mkname=mkname))); ## must be careful to guard out-of-bounds to NULL; happens automatically with string keys, but not with integer indexes
    } else if (type%in%c('raw','logical','integer','double','complex','character')) { ## atomic leaf node; build column
        lenList <- do.call(c,lapply(tlList,`[[`,'len'));
        maxLen <- max(lenList,na.rm=T);
        if (is.null(sep)) {
            ret <- lapply(seq_len(maxLen),function(i) setNames(data.frame(sapply(nodes,function(node) if (length(node) < i) NA else node[[i]]),...),mkname(c(keyList,i),maxLen)));
        } else {
            ## keep original type if maxLen is 1, IOW don't stringify
            ret <- list(setNames(data.frame(sapply(nodes,function(node) if (length(node) == 0L) NA else if (maxLen == 1L) node else paste(collapse=sep,node)),...),mkname(keyList,maxLen)));
        }; ## end if
    } else stop(sprintf('error: unsupported type %s at %s.',type,keyListToStr(keyList)));
    if (is.null(ret)) ret <- list(); ## handle corner case of exclusively empty sublists
    ret;
}; ## end extractLevelColumns()

## simple interface function
flattenList <- function(mainList,...) do.call(cbind,extractLevelColumns(mainList,...));
extractLevelColumns()函数遍历输入列表,提取每个叶节点位置的所有节点值,将它们合并为一个向量,并在缺失值处使用NA,然后转换为一个列数据框。该函数立即设置列名,利用参数化的mkname()函数将keyList的字符串化定义为字符串列名。多列作为数据框的列表从每个递归调用以及顶级调用中返回。
它还验证了并行节点之间没有类型不一致。虽然我之前手动验证了数据的一致性,但我尝试编写尽可能通用和可重用的解决方案,因为这样做总是一个好主意,所以这个验证步骤是适当的。 flattenList()是主要的接口函数;它只是调用extractLevelColumns(),然后调用do.call(cbind,...)将列组合成单个数据框。
此解决方案的优点是完全通用的;由于是完全递归的,因此可以处理无限数量的深度级别。此外,它没有包依赖项,参数化列名称构建逻辑,并将可变参数转发到data.frame(),因此例如您可以传递stringsAsFactors=F来抑制data.frame()通常对字符列进行的自动因子化,和/或row.names={namevector}来设置结果数据框的行名,或者row.names=NULL以防止使用输入列表中存在的顶级列表组件名称作为行名。
我还添加了一个默认为NULLsep参数。如果为NULL,则将多元素叶节点分隔成多个列,每个元素一个列,并在列名上添加索引后缀以区分。否则,它被视为字符串分隔符,用于将所有元素连接到单个字符串中,并仅为该节点生成一个列。
在性能方面,它非常快速。下面是一个演示:
## actually run it
system.time({ df <- flattenList(jsonRList); });
## extractLevelColumns(): /
## extractLevelColumns(): /npi
## extractLevelColumns(): /type
## extractLevelColumns(): /facility_name
## extractLevelColumns(): /facility_type
## extractLevelColumns(): /addresses
## extractLevelColumns(): /addresses/1
## extractLevelColumns(): /addresses/1/address
## extractLevelColumns(): /addresses/1/city
##
## ... snip ...
##
## extractLevelColumns(): /plans/9/network_tier
## extractLevelColumns(): /last_updated_on
## extractLevelColumns(): /name
## extractLevelColumns(): /name/first
## extractLevelColumns(): /name/middle
## extractLevelColumns(): /name/last
## extractLevelColumns(): /speciality
## extractLevelColumns(): /accepting
## extractLevelColumns(): /languages
## extractLevelColumns(): /gender
##    user  system elapsed
##   2.265   0.000   2.268

结果:

class(df); dim(df); names(df);
## [1] "data.frame"
## [1] 3256  126
##   [1] "npi"                    "type"                   "facility_name"          "facility_type.1"        "facility_type.2"        "facility_type.3"        "addresses.1.address"    "addresses.1.city"       "addresses.1.state"
##  [10] "addresses.1.zip"        "addresses.1.phone"      "addresses.1.address_2"  "addresses.2.address"    "addresses.2.city"       "addresses.2.state"      "addresses.2.zip"        "addresses.2.phone"      "addresses.2.address_2"
##  [19] "addresses.3.address"    "addresses.3.city"       "addresses.3.state"      "addresses.3.zip"        "addresses.3.phone"      "addresses.3.address_2"  "addresses.4.address"    "addresses.4.city"       "addresses.4.state"
##  [28] "addresses.4.zip"        "addresses.4.phone"      "addresses.4.address_2"  "addresses.5.address"    "addresses.5.address_2"  "addresses.5.city"       "addresses.5.state"      "addresses.5.zip"        "addresses.5.phone"
##  [37] "addresses.6.address"    "addresses.6.address_2"  "addresses.6.city"       "addresses.6.state"      "addresses.6.zip"        "addresses.6.phone"      "addresses.7.address"    "addresses.7.address_2"  "addresses.7.city"
##  [46] "addresses.7.state"      "addresses.7.zip"        "addresses.7.phone"      "addresses.8.address"    "addresses.8.address_2"  "addresses.8.city"       "addresses.8.state"      "addresses.8.zip"        "addresses.8.phone"
##  [55] "addresses.9.address"    "addresses.9.address_2"  "addresses.9.city"       "addresses.9.state"      "addresses.9.zip"        "addresses.9.phone"      "addresses.10.address"   "addresses.10.address_2" "addresses.10.city"
##  [64] "addresses.10.state"     "addresses.10.zip"       "addresses.10.phone"     "addresses.11.address"   "addresses.11.address_2" "addresses.11.city"      "addresses.11.state"     "addresses.11.zip"       "addresses.11.phone"
##  [73] "addresses.12.address"   "addresses.12.address_2" "addresses.12.city"      "addresses.12.state"     "addresses.12.zip"       "addresses.12.phone"     "addresses.13.address"   "addresses.13.city"      "addresses.13.state"
##  [82] "addresses.13.zip"       "addresses.13.phone"     "plans.1.plan_id_type"   "plans.1.plan_id"        "plans.1.network_tier"   "plans.2.plan_id_type"   "plans.2.plan_id"        "plans.2.network_tier"   "plans.3.plan_id_type"
##  [91] "plans.3.plan_id"        "plans.3.network_tier"   "plans.4.plan_id_type"   "plans.4.plan_id"        "plans.4.network_tier"   "plans.5.plan_id_type"   "plans.5.plan_id"        "plans.5.network_tier"   "plans.6.plan_id_type"
## [100] "plans.6.plan_id"        "plans.6.network_tier"   "plans.7.plan_id_type"   "plans.7.plan_id"        "plans.7.network_tier"   "plans.8.plan_id_type"   "plans.8.plan_id"        "plans.8.network_tier"   "plans.9.plan_id_type"
## [109] "plans.9.plan_id"        "plans.9.network_tier"   "last_updated_on"        "name.first"             "name.middle"            "name.last"              "speciality.1"           "speciality.2"           "speciality.3"
## [118] "speciality.4"           "accepting"              "languages.1"            "languages.2"            "languages.3"            "languages.4"            "languages.5"            "languages.6"            "gender"

生成的data.frame非常宽,但我们可以使用rowToFrame()npiToFrame(),以每次一行的方式来获得良好的垂直布局。例如,这是第一行:

rowToFrame(df[1L,]);
##                     column           value
## 1                      npi      1063645026
## 2                     type        FACILITY
## 3            facility_name EXPRESS SCRIPTS
## 4          facility_type.1      Pharmacies
## 5          facility_type.2            <NA>
## 6          facility_type.3            <NA>
## 7      addresses.1.address    4750 E 450 S
## 8         addresses.1.city      WHITESTOWN
## 9        addresses.1.state              IN
## 10         addresses.1.zip           46075
## 11       addresses.1.phone      2012695236
## 12   addresses.1.address_2            <NA>
## 13     addresses.2.address            <NA>
## 14        addresses.2.city            <NA>
## 15       addresses.2.state            <NA>
## 16         addresses.2.zip            <NA>
## 17       addresses.2.phone            <NA>
## 18   addresses.2.address_2            <NA>
## 19     addresses.3.address            <NA>
## 20        addresses.3.city            <NA>
## 21       addresses.3.state            <NA>
##
## ... snip ...
##
## 77        addresses.12.zip            <NA>
## 78      addresses.12.phone            <NA>
## 79    addresses.13.address            <NA>
## 80       addresses.13.city            <NA>
## 81      addresses.13.state            <NA>
## 82        addresses.13.zip            <NA>
## 83      addresses.13.phone            <NA>
## 84    plans.1.plan_id_type    HIOS-PLAN-ID
## 85         plans.1.plan_id  38344AK0620003
## 86    plans.1.network_tier   HERITAGE-PLUS
## 87    plans.2.plan_id_type    HIOS-PLAN-ID
## 88         plans.2.plan_id  38344AK0620004
## 89    plans.2.network_tier   HERITAGE-PLUS
## 90    plans.3.plan_id_type    HIOS-PLAN-ID
## 91         plans.3.plan_id  38344AK0620006
## 92    plans.3.network_tier   HERITAGE-PLUS
## 93    plans.4.plan_id_type    HIOS-PLAN-ID
## 94         plans.4.plan_id  38344AK0620008
## 95    plans.4.network_tier   HERITAGE-PLUS
## 96    plans.5.plan_id_type    HIOS-PLAN-ID
## 97         plans.5.plan_id  38344AK0570001
## 98    plans.5.network_tier   HERITAGE-PLUS
## 99    plans.6.plan_id_type    HIOS-PLAN-ID
## 100        plans.6.plan_id  38344AK0570002
## 101   plans.6.network_tier   HERITAGE-PLUS
## 102   plans.7.plan_id_type    HIOS-PLAN-ID
## 103        plans.7.plan_id  38344AK0980003
## 104   plans.7.network_tier   HERITAGE-PLUS
## 105   plans.8.plan_id_type    HIOS-PLAN-ID
## 106        plans.8.plan_id  38344AK0980006
## 107   plans.8.network_tier   HERITAGE-PLUS
## 108   plans.9.plan_id_type    HIOS-PLAN-ID
## 109        plans.9.plan_id  38344AK0980012
## 110   plans.9.network_tier   HERITAGE-PLUS
## 111        last_updated_on      2015-10-14
## 112             name.first            <NA>
## 113            name.middle            <NA>
## 114              name.last            <NA>
## 115           speciality.1            <NA>
## 116           speciality.2            <NA>
## 117           speciality.3            <NA>
## 118           speciality.4            <NA>
## 119              accepting            <NA>
## 120            languages.1            <NA>
## 121            languages.2            <NA>
## 122            languages.3            <NA>
## 123            languages.4            <NA>
## 124            languages.5            <NA>
## 125            languages.6            <NA>
## 126                 gender            <NA>

我已经进行了许多对单个记录的点检测,对结果进行了相当彻底的测试,一切看起来都是正确的。如果您有任何问题,请告诉我。


1
做得好,从大约1分钟缩短到2秒。+1。老实说,我仍然觉得这个答案很难理解并且看不出发生了什么... - A5C1D2H2I1M1N2O1R2T1
这太棒了!你应该得到更多的赞(并且应该获得悬赏奖励)! - dreamer
@bgoldst 我的问题非常相似。如果您能抽出一些时间,看看我如何在这里提问,我将不胜感激 https://stackoverflow.com/questions/63138416/saving-tidypmc-output-which-forms-a-list-object-and-saving-it-into-individual-fi - kcm

4
这篇答案更像是一个数据组织建议(比其他吸引奖励的答案要短得多;)。如果您想保留字段的语义,例如将所有plan_id保存在单个列中,则可以规范化数据设计,并在需要信息时进行连接。
library(dplyr)

# notice the simplifyVector=F
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json", simplifyVector=F) 

# pick and repeat fields for each element of array
# {field1:val, field2:val2, array:[{af1:av1, af2:av2}, {af1:av3, af2:av4}]}
# gives data.frame 
# field1, field2 array.af1 array.af2
# val     val2  av1        av2
# val     val2  av3        av4
denormalize <- function(data, fields, array) {
  data.frame(
    c(
      data[fields], 
      as.list(
        bind_rows(
          lapply(data[[array]], data.frame)))))
}

plans_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'plans'))
addresses_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'addresses'))
npis <- bind_rows(lapply(providers, function(d, fields) data.frame(d[fields]), 
                         c('npi', 'type', 'last_updated_on')))

然后,您可以先对数据进行筛选,再加入其他信息:
addresses_df %>%
  filter(city == "Healy") %>%
  left_join(plans_df, by="npi") ->
  plans_in_healy

2

虽然这并不是一个解决方案,因为它没有直接回答问题,但是我会告诉你如何分析这些数据。

首先,我需要了解你的数据集。它似乎是关于医疗服务提供者的信息。

 providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=FALSE ) 
 types = sapply(providers,"[[","type")
 table(types)

 # FACILITY INDIVIDUAL 
 #    279       2977 
  • FACILITY 条目具有 "ID" 字段 facility_namefacility_type
  • INDIVIDUAL 条目具有 "ID" 字段 namespecialityacceptinglanguagesgender
  • 所有条目都有 "ID" 字段 npilast_updated_on
  • 所有条目都有两个嵌套字段:addressesplans。例如,addresses 是一个包含城市、州等信息的列表。

由于每个 npi 都有多个地址,我希望将它们转换为一个数据框,其中包含城市、州等列。我也会为 plans 创建类似的数据框。然后,我将把 addressesplans 合并成一个单独的数据框。因此,如果有 4 个地址和 8 个计划,则合并后的数据框中将有 4*8=32 行。最后,我将使用另一个合并操作添加一个具有 "ID" 信息的类似非规范化数据框。

library(dplyr)
unfurl_npi_data = function (x) {
  repeat_cols = c("plans","addresses")
  id_cols = setdiff(names(x),repeat_cols)
  repeat_data = x[repeat_cols]
  id_data  = x[id_cols]

  # Denormalized ID data
  id_data_df = Reduce(function(x,y) merge(x,y,by=NULL), id_data, "")[,-1]
  atomic_colnames = names(which(!sapply(id_data, is.list)))
  df_atomic_cols = unlist(sapply(id_data,function(x) if(is.list(x)) rep(FALSE, length(x)) else TRUE))
  colnames(id_data_df)[df_atomic_cols] = atomic_colnames

  # Join the plans and addresses (denormalized)
  repeated_data = lapply(repeat_data, rbind_all)
  repeated_data_crossed = Reduce(merge, repeated_data, repeated_data[[1]])

  merge(id_data_df, repeated_data_crossed)
}

providers2 = split(providers, types)
providers3 = lapply(providers2, function(x) rbind_all(lapply(x, unfurl_npi_data)))

然后进行一些清理工作。

unique_df = function(x) {
  chr_col_names = names(which(sapply(x, class) == "character"))
  for( col in chr_col_names )
    x[[col]] = toupper(x[[col]])
  unique(x)
}
providers3 = lapply(providers3, unique_df)
facilities = providers3[["FACILITY"]]
individuals = providers3[["INDIVIDUAL"]]
rm(providers, providers2, providers3)

现在你可以问一些有趣的问题。例如,每个医疗保健提供者有多少地址?

 unique_providers = individuals %>% select(first, middle, last, gender, state, city, address) %>% unique()
 num_addresses = unique_providers %>% count(first, middle, last, gender)
 table(num_addresses$n)

 #    1    2    3    4    5    6    7    8    9   12   13 
 # 2258  492  119   33   43   21    6    1    2    1    1 

在有超过五个人的地址中,男性医疗保健提供者的百分比是多少?

address_pcts = unique_providers %>% 
  group_by(address, city, state) %>%
  filter(n()>5) %>%
  arrange(address) %>%
  summarise(pct_male = sum(gender=="MALE")/n())
library(ggplot2)
qplot(address_pcts$pct_male, binwidth=1/7) + xlim(0,1)

在这里输入图片描述

等等等等......


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接