将XML数据转换为R数据框。

4
我将尝试将一个XML文件转换为数据框,但格式似乎不正确。我查看了不同的教程,并且通过使用for循环和导航解析后的文件来获取所需信息,虽然我已经取得了一定的成功,但被告知这种解决方案效率不高。
然后我尝试了以下代码:
require(XML)
parsed<-xmlParse("SEWL.xml")
xmlToDataFrame(parsed)

但它会出现一个错误:在[<-.data.frame(*tmp*, i, names(nodes[[i]]), value = c("\"LL18179\"\"2016/08\"0.32485.43896.59801.2131\"OK\"", : 列的下标重复

这段代码可以运行,但格式不是我需要的:

require(XML)
require(plyr)
pldf<-ldply(xmlToList("SEWL.xml"),data.frame)

生成的数据框如下:
          .id              X..i.. text  .attrs test.code test.validuntil test.meas.text test.meas..attrs test.meas.text.1
1  technician              "John" <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
2    location                "CO" <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
3        temp                <NA> 21.3 celsius      <NA>            <NA>           <NA>             <NA>             <NA>
4     runtype           "routine" <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
5      sample                <NA> <NA>    2323 "LL18179"       "2016/08"         0.3248         baseline           5.4389
6      sample                <NA> <NA>    2323 "LL18179"       "2016/08"         0.3248         baseline           5.4389
7      sample                <NA> <NA> 8979237 "AA09453"       "2016/03"         0.0117         baseline           5.6012
8      sample                <NA> <NA> 8979237 "AA09453"       "2016/03"         0.0117         baseline           5.6012
9      .attrs 2015_07_31_11_33_22 <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
10     .attrs            20150731 <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
11     .attrs              113322 <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
   test.meas..attrs.1 test.meas.text.2 test.meas..attrs.2 test.calc test.result test..attrs test.code.1 test.validuntil.1
1                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
2                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
3                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
4                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
5                 std           6.5980               data    1.2131        "OK"      laslum "ATR150607"         "2017/05"
6                 std           6.5980               data    1.2131        "OK"           3 "ATR150607"         "2017/05"
7                 std           1.1431               data    0.2041      "FAIL"       absat        <NA>              <NA>
8                 std           1.1431               data    0.2041      "FAIL"           2        <NA>              <NA>
9                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
10               <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
11               <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
   test.meas.text.3 test.meas..attrs.3 test.meas.text.4 test.meas..attrs.4 test.meas.text.5 test.meas..attrs.5
1              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
2              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
3              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
4              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
5            0.0673           baseline           4.9721                std          10.3851               data
6            0.0673           baseline           4.9721                std          10.3851               data
7              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
8              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
9              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
10             <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
11             <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
   test.calc.1 test.result.1 test..attrs.1
1         <NA>          <NA>          <NA>
2         <NA>          <NA>          <NA>
3         <NA>          <NA>          <NA>
4         <NA>          <NA>          <NA>
5       2.0886     "Warning"           atr
6       2.0886     "Warning"             1
7         <NA>          <NA>          <NA>
8         <NA>          <NA>          <NA>
9         <NA>          <NA>          <NA>
10        <NA>          <NA>          <NA>
11        <NA>          <NA>          <NA>

这是我使用的示例XML文件:
<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
    <technician>"John"</technician>
    <location>"CO"</location>
    <temp scale="celsius">21.3</temp>
    <runtype>"routine"</runtype>
    <sample id="2323">
        <test name="laslum" order="3">
            <code>"LL18179"</code>
            <validuntil>"2016/08"</validuntil>
            <meas name="baseline">0.3248</meas>
            <meas name="std">5.4389</meas>
            <meas name="data">6.5980</meas>
            <calc>1.2131</calc>
            <result>"OK"</result>
        </test>
        <test name="atr" order="1">
            <code>"ATR150607"</code>
            <validuntil>"2017/05"</validuntil>
            <meas name="baseline">0.0673</meas>
            <meas name="std">4.9721</meas>
            <meas name="data">10.3851</meas>
            <calc>2.0886</calc>
            <result>"Warning"</result>
        </test>
    </sample>
    <sample id="8979237">
        <test name="absat" order="2">
            <code>"AA09453"</code>
            <validuntil>"2016/03"</validuntil>
            <meas name="baseline">0.0117</meas>
            <meas name="std">5.6012</meas>
            <meas name="data">1.1431</meas>
            <calc>0.2041</calc>
            <result>"FAIL"</result>
        </test>
    </sample>
</experiment>

我希望得到的数据框:

  experiment technician location temp runtype  sample   test order      code validuntil baseline    std    data   calc  result     date   time
1     abc123       John       CO 21.3 routine    2323 laslum     3   LL18179    2016/08   0.3248 5.4389  6.5980 1.2131      OK 20150731 113322
2     abc123       John       CO 21.3 routine    2323    atr     1 ATR150607    2017/05   0.0673 4.9721 10.3851 2.0886 Warning 20150731 113322
3     abc123       John       CO 21.3 routine 8979237  absat     2   AA09453    2016/03   0.0117 5.6012  1.1431 0.2041    FAIL 20150731 113322

我不需要完全相同的格式,只要有点接近就可以,这样我就可以将其转换成示例。


还有一个 XML2 包,也许值得一看。 - lmo
1个回答

6
我们提供两种解析XML的方法。第一种(对实验/样本/测试进行三次迭代)可能运行更快,但第二种(在单个循环中使用测试节点,并在每个测试节点中向上遍历树以获取其祖先)具有更简单的代码。
1)在注释末尾使用“Lines”,我们实现了对实验/样本/测试节点进行三次xpathApply/xpathSApply迭代。分别使用est表示当前的这个节点。
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

do.call("rbind", xpathApply(doc, "//experiment", function(e) {
  data.frame(experiment = xmlAttrs(e)[["name"]],
       technician = xmlValue(e[["technician"]]),
       location = xmlValue(e[["location"]]),
       temp = xmlValue(e[["temp"]]),
       runtype = xmlValue(e[["runtype"]]),
       t(do.call(cbind, xpathApply(e, "sample", function(s) {
            sample <- xmlAttrs(s)[["id"]]
            xpathSApply(s, "test", function(t) {
                   c(sample = sample,
                        test = xmlAttrs(t)[["name"]],
                        order = xmlAttrs(t)[["order"]],
                        code = xmlValue(t[["code"]]),
                        validuntil = xmlValue(t[["validuntil"]]),
                        baseline = xmlValue(t["meas"][[1]]),
                        std = xmlValue(t["meas"][[2]]),
                        data = xmlValue(t["meas"][[3]]),
                        calc = xmlValue(t[["calc"]]),
                        result = xmlValue(t[["result"]])
             )})}))),
       date = xmlAttrs(e)[["date"]],
       time = xmlAttrs(e)[["time"]]
)}))

提供:

  experiment technician location temp   runtype  sample   test order
1     abc123     "John"     "CO" 21.3 "routine"    2323 laslum     3
2     abc123     "John"     "CO" 21.3 "routine"    2323    atr     1
3     abc123     "John"     "CO" 21.3 "routine" 8979237  absat     2
         code validuntil baseline    std    data   calc    result     date
1   "LL18179"  "2016/08"   0.3248 5.4389  6.5980 1.2131      "OK" 20150731
2 "ATR150607"  "2017/05"   0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3   "AA09453"  "2016/03"   0.0117 5.6012  1.1431 0.2041    "FAIL" 20150731
    time
1 113322
2 113322
3 113322

2)这是一种替代方法,我们仅循环测试节点,然后向上访问父节点和祖父节点以获取相应的样本和实验信息。

library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

do.call("rbind", xpathApply(doc, "//test", function(t) { # t is test node
        s <- xmlParent(t) # s is sample node
        e <- xmlParent(s) # e is experiment node
        data.frame(experiment = xmlAttrs(e)[["name"]],
          technician = xmlValue(e[["technician"]]),
          location = xmlValue(e[["location"]]),
          temp = xmlValue(e[["temp"]]),
          runtype = xmlValue(e[["runtype"]]),
          sample = xmlAttrs(s)[["id"]],
          test = xmlAttrs(t)[["name"]],
          order = xmlAttrs(t)[["order"]],
          code = xmlValue(t[["code"]]),
          validuntil = xmlValue(t[["validuntil"]]),
          baseline = xmlValue(t["meas"][[1]]),
          std = xmlValue(t["meas"][[2]]),
          data = xmlValue(t["meas"][[3]]),
          calc = xmlValue(t[["calc"]]),
          result = xmlValue(t[["result"]]),
          date = xmlAttrs(e)[["date"]],
          time = xmlAttrs(e)[["time"]]
       )
}))

提供:

  experiment technician location temp   runtype  sample   test order
1     abc123     "John"     "CO" 21.3 "routine"    2323 laslum     3
2     abc123     "John"     "CO" 21.3 "routine"    2323    atr     1
3     abc123     "John"     "CO" 21.3 "routine" 8979237  absat     2
         code validuntil baseline    std    data   calc    result     date
1   "LL18179"  "2016/08"   0.3248 5.4389  6.5980 1.2131      "OK" 20150731
2 "ATR150607"  "2017/05"   0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3   "AA09453"  "2016/03"   0.0117 5.6012  1.1431 0.2041    "FAIL" 20150731
    time
1 113322
2 113322
3 113322

注意1:

另外,如果你将输入的 XML 文件 SEWL.xml 读入 Excel 中,它会将其合理地放入表格格式中,尽管还需要进一步处理才能准确地得到问题中所需的形式。

注意2:

输入的 R 对象 Lines 是:

Lines <- '<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
    <technician>"John"</technician>
    <location>"CO"</location>
    <temp scale="celsius">21.3</temp>
    <runtype>"routine"</runtype>
    <sample id="2323">
        <test name="laslum" order="3">
            <code>"LL18179"</code>
            <validuntil>"2016/08"</validuntil>
            <meas name="baseline">0.3248</meas>
            <meas name="std">5.4389</meas>
            <meas name="data">6.5980</meas>
            <calc>1.2131</calc>
            <result>"OK"</result>
        </test>
        <test name="atr" order="1">
            <code>"ATR150607"</code>
            <validuntil>"2017/05"</validuntil>
            <meas name="baseline">0.0673</meas>
            <meas name="std">4.9721</meas>
            <meas name="data">10.3851</meas>
            <calc>2.0886</calc>
            <result>"Warning"</result>
        </test>
    </sample>
    <sample id="8979237">
        <test name="absat" order="2">
            <code>"AA09453"</code>
            <validuntil>"2016/03"</validuntil>
            <meas name="baseline">0.0117</meas>
            <meas name="std">5.6012</meas>
            <meas name="data">1.1431</meas>
            <calc>0.2041</calc>
            <result>"FAIL"</result>
        </test>
    </sample>
</experiment>'

这似乎是正确的方向。我如何通过调用实际的XML文件来替换Lines对象? - Variax
删除 asText=TRUE,并在 Lines 的位置使用文件名。为了在 SO 上显示,我们使用字符串输入来保持演示自包含。 - G. Grothendieck
已添加第二种方法。第一种可能更快,但第二种代码更简单。 - G. Grothendieck

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接