从XML中提取数据并传递到data.frame（对于缺失的使用NA）

Question

从XML中提取数据并传递到data.frame（对于缺失的使用NA）

3

我有一个XML文件，想从中提取数据。到目前为止，我已经使用tidyverse和xml2软件包完成了所有操作，但我无法解决XML问题中的下一步难题。

样例XML：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:ArchiveView>
    <Notification ID="1001">
        <persons>
            <Timestamp>07:39:25</Timestamp>
            <person type="A" name="Barney">
                <uniqueUserId>2222</uniqueUserId>
            </person>
        </persons>
        <persons>
            <Timestamp>08:40:25</Timestamp>
            <person type="B" name="John">
                <uniqueUserId>1111</uniqueUserId>
            </person>
        </persons>
    </Notification>
    <Notification ID="1002">
        <persons>
            <Timestamp>14:39:25</Timestamp>
            <person type="A" name="Barney">
                <uniqueUserId>2222</uniqueUserId>
            </person>
        </persons>
    </Notification>
    <Notification ID="1003">
    </Notification>
</ns2:ArchiveView>

由于可以分配给通知的最大人数为3人，因此我希望最终得到一个类似于以下数据框的结果：

ID    name1    time1     type1    name2    time2     type2    name3    time3     type3
1001  Barney   07:39:25  A        John     08:40:25  B        NA       NA        NA
1002  Barney   14:39:25  A        NA       NA        NA       NA       NA        NA
1003  NA       NA        NA       NA       NA        NA       NA       NA        NA

我已经得到的内容如下：

doc <- read_xml( "./data/test.xml" )

提取所有ID

df.ID <- data.frame( 
           ID = xml_find_all( doc, ".//Notifications" ) %>% xml_attrs() %>%  unlist() , 
           stringsAsFactors = FALSE )

识别具有附加人员的通知的ID

ID.with.persons <- xml_find_all( doc, ".//Notifications[ persons ]" ) %>% 
                   xml_attrs() %>% 
                   unlist()

创建一个包含人员附加信息的通知节点集。

nodes.persons <- xml_find_all( doc, ".//Notifications[ persons ]"

我还成功地将所有人的姓名（放在一个向量中）获取到了。

persons.name <- nodes.persons %>% xml_attr("name") %>% unlist()

我感觉我离解决方案很接近了，但是我无法理解如何将所有这些数据合并成一个漂亮的数据框（如上所述）。欢迎提出所有建议 :)

- Wimpel

这不是一个有效的XML，因为命名空间前缀 ns2 从未被分配。请发布完整的根或实际的XML示例。 - Parfait

2个回答

1

这里是解决方案。它比我想象中需要更多手动编码，但确实展示了解决方法的技巧：

library(xml2)
doc<-read_xml("*Your xml Document goes here*")

#find the Notification nodes
Notices<-xml_find_all( doc, ".//Notification" )

#find all of the timestamps in each Notification
timestamps<-sapply(Notices, function(x){xml_text(xml_find_all(x, ".//Timestamp"))})

#extract the three timestamps in each Notification (missing ones return NA)
#sapply returns a column, need to transpose to create the row in the data frame
time.df<-data.frame(t(sapply(timestamps, function(x){c(x[1], x[2], x[3])})))
#rename the column names
names(time.df)<-paste0("time", 1:3)

#repeat for the person's name and type
persons.name <-sapply(Notices, function(x){x %>% xml_find_all(  ".//person" ) %>% xml_attr("name")})
name.df<-data.frame(t(sapply(persons.name, function(x){c(x[1], x[2], x[3])})))
names(name.df)<-paste0("name", 1:3)

persons.type <-sapply(Notices, function(x){x %>% xml_find_all(  ".//person" ) %>% xml_attr("type")})
type.df<-data.frame(t(sapply(persons.type, function(x){c(x[1], x[2], x[3])})))
names(type.df)<-paste0("type", 1:3)

#assemble the final answer and rearrange the column order
answer<-cbind(name.df, time.df, type.df)
answer<-answer[,c(1, 4, 7, 2, 5, 8, 3, 6, 9)]

df.ID <- data.frame(ID = xml_find_all( doc, ".//Notification" ) %>%  
        xml_attr("ID"), stringsAsFactors = FALSE)
answer<-cbind(df.ID, answer)

代码的注释解释了解决方案所采取的步骤。我相信还有一些优化的可能性，但这是一个很好的开始。

- Dave2e

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- miken32 · Accepted Answer

这是一种非常实用的方法（我对R还不太熟悉，所以可能不太符合R的风格）。只需循环遍历每个元素，将所需元素放入向量中。最后将其转换为矩阵并插入到数据帧中。这只有在构建矩阵时有固定列数时才有效。

library(xml2)
doc <- read_xml("test.xml")
row <- c()
notifications <- xml_find_all(doc, ".//Notification")
for (i in 1:length(notifications)) {
    row <- c(row, xml_attr(notifications[i], "ID"))
    for (j in 1:3) {
        person <- xml_find_all(notifications[i], sprintf("persons[%d]", j))
        if (length(person) > 0) {
            row <- c(row, xml_find_chr(person, "string(./person/@name)"))
            row <- c(row, xml_find_chr(person, "string(./Timestamp/text())"))
            row <- c(row, xml_find_chr(person, "string(./person/@type)"))
        } else {
            row <- c(row, NA, NA, NA)
        }
    }
}
df <- data.frame(matrix(data=rows, ncol=10, byrow=TRUE))
colnames(df) <- c("ID", "name1", "time1", "type1", "name2", "time2", "type2", "name3", "time3", "type3")
df

输出：

    ID  name1    time1 type1 name2    time2 type2 name3 time3 type3
1 1001 Barney 07:39:25     A  John 08:40:25     B  <NA>  <NA>  <NA>
2 1002 Barney 14:39:25     A  <NA>     <NA>  <NA>  <NA>  <NA>  <NA>
3 1003   <NA>     <NA>  <NA>  <NA>     <NA>  <NA>  <NA>  <NA>  <NA>