使用XML包解析RSS源

Question

使用XML包解析RSS源

5

我正在尝试抓取和解析以下RSS源：http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml。我已经查看了其他关于R和XML的查询，但无法解决我的问题。每个条目的XML代码如下：

        <item>
     <title><![CDATA[Five Rockets Intercepted By Iron Drone Systems Over Be'er Sheva]]></title>
     <link>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</link>
     <description><![CDATA[<a href="http://www.haaretz.com/news/diplomacy-defense/live-blog-rockets-strike-tel-aviv-area-three-israelis-killed-in-attack-on-south-1.477960" target="_hplink">Haaretz reports</a> that five more rockets intercepted by Iron Dome systems over Be'er Sheva. In total, there have been 274 rockets fired and 105 intercepted. The IDF has attacked 250 targets in Gaza.]]></description>
     <guid>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</guid>
     <pubDate>2012-11-15T12:56:09-05:00</pubDate>
     <source url="http://huffingtonpost.com/rss/liveblog/liveblog-1213.xml">Huffingtonpost.com</source>
  </item>

对于每一篇文章/帖子，我想记录“日期”（pubDate），“标题”（title），“描述”（已清理的完整文本）。我尝试使用R中的xml软件包，但承认我是一个新手（几乎没有使用XML的经验，但有一些R的经验）。我正在使用以下代码进行工作，但却毫无进展：

 library(XML)

 xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml"

 # Use the xmlTreePares-function to parse xml file directly from the web

 xmlfile <- xmlTreeParse(xml.url)

# Use the xmlRoot-function to access the top node

xmltop = xmlRoot(xmlfile)

xmlName(xmltop)

names( xmltop[[ 1 ]] )

  title          link   description      language     copyright 
  "title"        "link" "description"    "language"   "copyright" 
 category     generator          docs          item          item 
  "category"   "generator"        "docs"        "item"        "item"

然而，每当我试图操纵“标题”或“描述”信息时，都会不断出现错误。如何解决这段代码的问题，任何帮助都将不胜感激。

谢谢， Thomas

- Thomas

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- agstudy · Accepted Answer

我正在使用优秀的Rcurl库和xpathSApply函数

这个脚本会给你三个列表（标题、发布日期和描述）

library(RCurl)
library(XML)
xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml"
script  <- getURL(xml.url)
doc     <- xmlParse(script)
titles    <- xpathSApply(doc,'//item/title',xmlValue)
descriptions    <- xpathSApply(doc,'//item/description',xmlValue)
pubdates <- xpathSApply(doc,'//item/pubDate',xmlValue)