使用Nokogiri解析XML

3

我在配置Nokogiri时遇到了一些问题,并且他们的文档有点难以入手。

我正在尝试解析XML文件:http://www.kongregate.com/games_for_your_site.xml

该文件返回多个游戏,每个游戏都有标题、描述等信息...

<gameset>
  <game>
    <id>160342</id>
    <title>Tricky Rick</title>
    <thumbnail>
      http://cdn3.kongregate.com/game_icons/0042/7180/KONG_icon250x200_site.png?21656-op
    </thumbnail>
    <launch_date>2012-12-12</launch_date>
    <category>Puzzle</category>
    <flash_file>
      http://external.kongregate-games.com/gamez/0016/0342/live/embeddable_160342.swf
    </flash_file>
    <width>640</width>
    <height>480</height>
    <url>
      http://www.kongregate.com/games/tAMAS_Games/tricky-rick
    </url>
    <description>
      Help Rick to collect all the stolen fuel to refuel his spaceship and fly away from the planet. Use hammer, bombs, jetpack and other useful stuff to solve puzzles!
    </description>
    <instructions>
      WASD \ Arrow Keys &#8211; move; S \ Down Arrow &#8211; take\release an object; CNTRL &#8211; interaction with objects: throw, hammer strike, invisibility mode; SPACE &#8211; interaction with elevators and fuel stations; Esc \ P &#8211; pause;
    </instructions>
    <developer_name>tAMAS_Games</developer_name>
    <gameplays>24999</gameplays>
    <rating>3.43</rating>
  </game>
  <game>
    <id>160758</id>
    <title>Flying Cookie Quest</title>
    <thumbnail>
      http://cdn2.kongregate.com/game_icons/0042/8428/icon_cookiequest_kong_250x200_site.png?16578-op
    </thumbnail>
    <launch_date>2012-12-07</launch_date>
    <category>Action</category>
    <flash_file>
      http://external.kongregate-games.com/gamez/0016/0758/live/embeddable_160758.swf
    </flash_file>
    <width>640</width>
    <height>480</height>
    <url>
      http://www.kongregate.com/games/LongAnimals/flying-cookie-quest
    </url>
    <description>
      Launch Rocket Panda into the land of Cookies. With the help of low-flying sharks, hang-gliding sheep and Rocket Badger, can you defeat the all powerful Biscuit Head? Defeat All enemies of cookies in this launcher game.
    </description>
    <instructions>Use the mouse button!</instructions>
    <developer_name>LongAnimals</developer_name>
    <gameplays>168672</gameplays>
    <rating>3.67</rating>
  </game>

根据文档,我正在使用类似以下内容的东西:

require 'nokogiri'
require 'open-uri'

url = "http://www.kongregate.com/games_for_your_site.xml"
xml = Nokogiri::XML(open(url))
xml.xpath("//game").each do |node|
    puts node.xpath("//id")
    puts node.xpath("//title")
    puts node.xpath("//thumbnail")
    puts node.xpath("//category")
    puts node.xpath("//flash_file")
    puts node.xpath("//width")
    puts node.xpath("//height")
    puts node.xpath("//description")
    puts node.xpath("//instructions")
end

但是,它只是返回无穷的数据,并且并非按照一定的集合形式。任何帮助将会很有用。

你认为 Nokogiri 文档有什么不足之处?http://nokogiri.org/ 上的教程是否缺乏某些内容?rdoc 文档中是否缺少一些能帮助你的内容? - Mike Dalessio
这实际上与下面锡人所说的有关。 - thebusiness11
2个回答

20
这是我会重写你的代码的方式:
xml = Nokogiri::XML(open("http://www.kongregate.com/games_for_your_site.xml"))
xml.xpath("//game").each do |game|
  %w[id title thumbnail category flash_file width height description instructions].each do |n|
    puts game.at(n)
  end
end

你的代码问题在于所有子标签都以 // 为前缀,这意味着在 XPath 中,“从根节点开始向下搜索包含该文本的所有标签”。因此,它不仅在每个 //game 节点内进行搜索,还在每个 //game 节点中列出的标签中搜索整个文档。
我建议使用 CSS 访问器而非 XPath,因为它们更简单(通常)且易于阅读。所以,我使用 search('game') 代替 xpath('//game')。(search 将接受 CSS 或 XPath 访问器,at 也是如此。)
如果你想要标签中包含的文本,请将 puts game.at(n) 更改为:
puts game.at(n).text

为了使输出更有用,我会这样做:
require 'nokogiri'
require 'open-uri'

xml = Nokogiri::XML(open('http://www.kongregate.com/games_for_your_site.xml'))
games = xml.search('game').map do |game|
  %w[
    id title thumbnail category flash_file width height description instructions
  ].each_with_object({}) do |n, o|
    o[n] = game.at(n).text
  end
end

require 'awesome_print'
puts games.size
ap games.first
ap games.last

这会导致:
395
{
              "id" => "160342",
          "title"  => "Tricky Rick",
      "thumbnail"  => "http://cdn3.kongregate.com/game_icons/0042/7180/KONG_icon250x200_site.png?21656-op",
        "category" => "Puzzle",
      "flash_file" => "http://external.kongregate-games.com/gamez/0016/0342/live/embeddable_160342.swf",
          "width"  => "640",
          "height" => "480",
    "description"  => "Help Rick to collect all the stolen fuel to refuel his spaceship and fly away from the planet. Use hammer, bombs, jetpack and other useful stuff to solve puzzles!\n",
    "instructions" => "WASD \\ Arrow Keys &#8211; move;\nS \\ Down Arrow &#8211; take\\release an object;\nCNTRL &#8211; interaction with objects: throw, hammer strike, invisibility mode;\nSPACE &#8211; interaction with elevators and fuel stations;\nEsc \\ P &#8211; pause;\n"
}
{
              "id" => "78",
          "title"  => "rotaZion",
      "thumbnail"  => "http://cdn2.kongregate.com/game_icons/0000/0115/pixtiz.rotazion_icon.jpg?8217-op",
        "category" => "Action",
      "flash_file" => "http://external.kongregate-games.com/gamez/0000/0078/live/embeddable_78.swf",
          "width"  => "350",
          "height" => "350",
    "description"  => "In rotaZion, you play with a bubble bar that you can&#8217;t stop rotating !\nCollect the bubbles and try to avoid the mines !\nCollect the different bonus to protect your bubble bar, makes the mines go slower or destroy all the mines !\nTry to beat 100.000 points ;)\n",
    "instructions" => "Move the bubble bar with the arrow keys !\nBubble = 500 Points !\nPixtiz sign = 5000 Points !\n"
}

1
非常棒的答案。+1 针对所有解释和代码。 - nikhil
2
当人们开始使用XPath //时,它会让每个人都感到困惑。 - the Tin Man
这很棒,但最终目标是将其存储到数据库中,每个游戏集合内的每个游戏都对应一行。这个数组可以完成这个任务吗? - thebusiness11
1
很容易。我们经常这样做,但具体如何留给您自己去解决。一个提示是每个嵌入的哈希都是一个单独的行。如果键不直接映射到字段名,您可以创建一个包含适当字段名的数组,并将其与每个哈希的“值”一起使用zip,然后使用类似于Hash[['foo','bar'] .zip(hash.values)]的东西将其转换为哈希。此外,一些DBM可以直接导入XML,因此可能不需要解析它。导入到临时表中,删除不需要的字段,然后将结果表复制到生产表中。 - the Tin Man

1
你可以尝试这样做。我建议为游戏内部的元素创建一个数组,然后对它们进行迭代。我相信在Nokogiri中有一种方法可以获取指定元素内的所有元素,但是这种方法可行:
   xml = Nokogiri::XML(result)
    xml.css("game").each do |inv|
      inv.css("title").each do |f|  # title or whatever else you want
        puts f.inner_html
      end
    end

2
inner_html 很少有用。在这种情况下,您真正想要的是 f.text,并且由于每个游戏只有一个标题,因此没有太多需要使用 each - pguardiario

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接