fileName <- system.file("pensions", "pensions_funds.xml", package="XML")
parsed<-xmlTreeParse("pension_funds.xml",getDTD=F)
r<-xmlRoot(parsed)
tmp = xmlSApply(r, function(x) xmlSApply(x, xmlValue))
上面几行基本上是按照这里的例子 http://www.omegahat.org/RSXML/gettingStarted.html 进行的,但我认为我首先需要忽略头部(我已经粘贴了我要读取的文件的前几页)。因此,我认为上述方法可能有效,但对于我的目的来说,它从错误的节点开始。我想通过时间段和参考区域索引 obs_values。
首先要做的是找到正确的节点并从那里开始,但我怀疑我可能在做一件愚蠢的事情,因为我对数据格式的知识有限,而且我不确定 XML 包是否可用于 SDMX-XML 文件。更聪明的人似乎已经尝试过这样做 http://opensdmxdevelopers.wikispaces.com/RSDMX 我在这里找不到这个软件包的下载链接 https://r-forge.r-project.org/projects/rsdmx/ (我看不到任何链接/下载部分,但也许我眼瞎),而且它似乎还处于早期阶段。rsdmx 的存在表明使用 xml 包读取 sdmx 可能不容易,因此我准备在这个阶段放弃,除非有人已经成功实现了这一点。实际上,我主要是想读取这个文件 http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml 但这是一个 10mb 的文件,所以我从小的文件开始。
编辑3 尝试使用 Mischa 的评论中的更改在大文件上运行 sgibb 的答案 library("XML")
url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"
sdmxHandler <- function() {
## data.frame which stores results
data <- data.frame(stringsAsFactors=FALSE)
## counter to store current row
i <- 1
## temp value to store current REF_AREA
## temp value to store current REF_AREA
refArea <- NA
bsItem <- NA
bsCountSector <- NA
## handler subroutine for Obs tag
Obs <- function(name, attr) {
## found an Obs tag and now fill data.frame
data[i, "refArea"] <<- refArea
data[i, "timePeriod"] <<- as.numeric(attr["TIME_PERIOD"])
data[i, "obsValue"] <<- as.numeric(attr["OBS_VALUE"])
data[i, "bsItem"] <<- bsItem
data[i, "bsCountSector"] <<- bsCountSector
i <<- i + 1
}
## handler subroutine for Series tag
Series <- function(name, attr) {
refArea <<- attr["REF_AREA"]
bsItem <<- as.character(attr["BS_ITEM"])
bsCountSector <<- as.numeric(attr["BS_ITEM"])
}
return(list(getData=function() {return(data)},
Obs=Obs, Series=Series))
}
## run parser
df <- xmlEventParse(file(url), handlers=sdmxHandler())$getData()
Specification mandate value for attribute OBS_VALUE
attributes construct error
Couldn't find end of Start Tag Obs line 15108
Premature end of data in tag Series line 15041
Premature end of data in tag DataSet line 91
Premature end of data in tag CompactData line 2
Error: 1: Specification mandate value for attribute OBS_VALUE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 15108
4: Premature end of data in tag Series line 15041
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)
编辑2: sgibb的答案看起来很理想,在较小的文件上完全可以运行。我试图在上面运行它,但出现了一些问题。
url <- http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml
(10mb文件,原始链接已更正),唯一的修改是添加了两行代码:
data[i, "bsItem"] <<- as.character(attr["BS_ITEM"])
data[i, "bsCountSector"] <<- as.numeric(attr["BS_COUNT_SECTOR"])
这些是额外的ID变量,需要用于识别较大数据集中的行。它运行了几分钟,然后出现了以下错误:
此外:有50个以上的警告(使用warnings()函数查看前50个警告)错误:1:属性TIME_PE的规范命令值
2: 属性构造错误
3: 找不到Start Tag Obs的结束行20743
4: Series标签的数据过早结束行20689
5: DataSet标签的数据过早结束行91 6: CompactData标签的数据过早结束行2
数据的基本格式似乎非常相似,所以我认为这可能有效。 10MB文件的基本格式如下:
<Series FREQ="M" REF_AREA="AT" ADJUSTMENT="N" BS_REP_SECTOR="A" BS_ITEM="A20" MATURITY_ORIG="A" DATA_TYPE="1" COUNT_AREA="U2" BS_COUNT_SECTOR="0000" CURRENCY_TRANS="Z01" BS_SUFFIX="E" TIME_FORMAT="P1M" COLLECTION="E">
<Obs TIME_PERIOD="1997-09" OBS_VALUE="275.3" OBS_STATUS="A" OBS_CONF="F"/>
<Obs TIME_PERIOD="1997-10" OBS_VALUE="275.9" OBS_STATUS="A" OBS_CONF="F"/>
<Obs TIME_PERIOD="1997-11" OBS_VALUE="276.6" OBS_STATUS="A" OBS_CONF="F"/>
编辑1:
期望的数据格式:
Ref_area time_period obs_value
At 2006 118
At 2007 119
…
Be 2006 101
…
这是第一部分数据。
</Header>
DataSet xsi:schemaLocation="https://www.ecb.europa.eu/vocabulary/stats/icpf/1 https://www.ecb.europa.eu/stats/sdmx/icpf/1/structure/2011-08-11/sdmx-compact.xsd" xmlns="https://www.ecb.europa.eu/vocabulary/stats/icpf/1">
<Group DECIMALS="0" TITLE_COMPL="Austria, reporting institutional sector Insurance corporations and pension funds - Closing balance sheet - All financial assets and liabilities - counterpart area World (all entities), counterpart institutional sector Total economy including Rest of the World (all sectors) - Credit (resources/liabilities) - Non-consolidated, Current prices - Euro, Neither seasonally nor working day adjusted - ESA95 TP table Not applicable" UNIT_MULT="9" UNIT="EUR" ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT"/><Series ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT" COLLECTION="E" TIME_FORMAT="P1Y" FREQ="A"><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="112" TIME_PERIOD="2008"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="119" TIME_PERIOD="2009"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="125" TIME_PERIOD="2010"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="127" TIME_PERIOD="2011"/></Series><Group D