使用Ruby和Nokogiri解析大型XML文件

Question

使用Ruby和Nokogiri解析大型XML文件

9

我需要定期解析一个大型的XML文件（约10K行），它的格式如下：

<summarysection>
    <totalcount>10000</totalcount>
</summarysection>
<items>
     <item>
         <cat>Category</cat>
         <name>Name 1</name>
         <value>Val 1</value>
     </item>
     ...... 10,000 more times
</items>

我希望做的是使用nokogiri解析每个单独的节点，以计算一个类别中的项目数量。然后，我想从total_count中减去该数字，以获得一个输出，其读数为“Interest_Category的数量：n，所有其他内容的数量：z”。

这是我的代码：

#!/usr/bin/ruby

require 'rubygems'
require 'nokogiri'
require 'open-uri'

icount = 0 
xmlfeed = Nokogiri::XML(open("/path/to/file/all.xml"))
all_items = xmlfeed.xpath("//items")

  all_items.each do |adv|
            if (adv.children.filter("cat").first.child.inner_text.include? "partofcatname")
                icount = icount + 1
            end
  end

othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount 

puts icount
puts othercount

这似乎可以工作，但速度非常慢！我说的是处理10,000个项目需要超过10分钟。有更好的方法吗？我的做法不够优化吗？

- DNadel

1

https://dev59.com/xlHTa4cB1Zd3GeqPRnMd - Ismael

@ismaelga 我知道nokogiri通常在这些方面非常迅速。我更想知道我的语法是否最大程度地利用了这个gem，或者我的代码是否可以进行优化。 - DNadel

链接中的第一个答案提到了使用除xpath以外的其他方法，这样可以提高性能。 - Ismael

5个回答

4

我建议使用SAX解析器而不是DOM解析器来处理这个大文件。Nokogiri内置了一个很好的SAX解析器：http://nokogiri.org/Nokogiri/XML/SAX.html。

对于大文件，使用SAX方式处理非常好，因为它不会构建一个巨大的DOM树，在您的情况下，这是不必要的；当事件触发时，您可以构建自己的结构（例如计数节点）。

- Eric Wood

就比较而言，可以看看我的答案；虽然SAX的内存节省很好（有时候是关键），但即使对于像这样微不足道的东西，性能也更差。 - Phrogz

3

您可以通过将您的代码更改为以下内容来大幅缩短执行时间。只需将“99”更改为您想要检查的任何类别即可。

require 'rubygems'
require 'nokogiri'
require 'open-uri'

icount = 0 
xmlfeed = Nokogiri::XML(open("test.xml"))
items = xmlfeed.xpath("//item")
items.each do |item|
  text = item.children.children.first.text  
  if ( text =~ /99/ )
    icount += 1
  end
end

othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount 

puts icount
puts othercount

在我的机器上，这个操作大约需要三秒钟。我认为你犯了一个关键错误，就是选择了“items”进行迭代，而不是创建一个“item”节点的集合。这使得你的迭代代码变得笨拙且缓慢。

- vlasits

1

我不会声称我的方法比使用SAX解析器更好，但它确实做到了你所说的想要的：将执行时间缩短到可管理的范围。 - vlasits

谢谢！您建议的更改起了作用。实际上，我正在对项（item）进行迭代而不是items（代码修改错误以进行发布），但是你使用的匹配方法（=~）比“includes?”快得多。为什么这比“包含？”快这么多呢？无论如何，现在它可以在不到5分钟内工作了。再次感谢！ - DNadel

1

只是好奇。现在需要多长时间？你说“不到5分钟”，但我的结果仅用了3秒钟就返回了。 - vlasits

此外，速度的提升并不是由于include？和=~之间的差异。它们的基准测试相当相似。我猜这个“adv.children.filter（“cat”）.first.child.inner_text”是性能较差的部分。 - vlasits

请查看我的答案，了解SAX与DOM的速度比较以及为什么您可能运行缓慢。 - Phrogz

显示剩余3条评论

0

看看Greg Weber对Paul Dix的sax-machine宝石的版本： http://blog.gregweber.info/posts/2011-06-03-high-performance-rb-part1

使用SaxMachine解析大文件似乎会将整个文件加载到内存中

sax-machine使代码变得简单得多；Greg的变体使其流式处理。

- Martin Cleaver

0

你可能想要尝试一下这个 - https://github.com/amolpujari/reading-huge-xml。

HugeXML.read xml, elements_lookup do |element|
  # => element{ :name, :value, :attributes}
end

我也尝试使用ox。

- Amol Pujari

2

虽然此链接可能回答了问题，但最好在此处包括答案的主要部分并提供参考链接。如果链接的页面发生更改，则仅有链接的答案可能会失效。 - Tisho

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Phrogz · Accepted Answer

这里有一个例子，比较了使用SAX解析器和基于DOM的解析器进行计数的结果，对500,000个具有七个类别之一的<item>进行计数。首先是输出：

创建XML文件：1.7秒
通过SAX进行计数：12.9秒
创建DOM：1.6秒
通过DOM进行计数：2.5秒

两种技术都生成相同的哈希值，以计算出每个类别的数量:

{"Cats"=>71423, "Llamas"=>71290, "Pigs"=>71730, "Sheep"=>71491, "Dogs"=>71331, "Cows"=>71536, "Hogs"=>71199}

SAX版本用了12.9秒来计算和分类，而DOM版本只需1.6秒创建DOM元素，再花2.5秒查找和分类所有<cat>值。DOM版本大约快3倍！......但这还不是全部。我们还需要考虑RAM的使用情况。

- 对于500,000个项目，SAX（12.9s）在238MB RAM峰值处；DOM（4.1s）峰值为1.0GB。 - 对于1,000,000个项目，SAX（25.5s）在243MB RAM峰值处；DOM（8.1s）峰值为2.0GB。 - 对于2,000,000个项目，SAX（55.1s）在250MB RAM峰值处；DOM (???)峰值为3.2GB。

我的机器上有足够的内存可以处理1,000,000个项目，但在2,000,000时我耗尽了RAM并不得不开始使用虚拟内存。即使配备SSD和快速的机器，我让DOM代码运行了将近十分钟才最终结束它。

您报告的长时间很可能是因为您正在耗尽RAM并作为虚拟内存的一部分不断地打磨盘。如果您可以将DOM适应内存，则使用它，因为它很快。但是，如果您不能这样做，则确实必须使用SAX版本。

以下是测试代码：

require 'nokogiri'

CATEGORIES = %w[ Cats Dogs Hogs Cows Sheep Pigs Llamas ]
ITEM_COUNT = 500_000

def test!
  create_xml
  sleep 2; GC.start # Time to read memory before cleaning the slate
  test_sax
  sleep 2; GC.start # Time to read memory before cleaning the slate
  test_dom
end

def time(label)
  t1 = Time.now
  yield.tap{ puts "%s: %.1fs" % [ label, Time.now-t1 ] }
end

def test_sax
  item_counts = time("Count via SAX") do
    counter = CategoryCounter.new
    # Use parse_file so we can stream data from disk instead of flooding RAM
    Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')
    counter.category_counts
  end
  # p item_counts
end

def test_dom
  doc = time("Create DOM"){ File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) } }
  counts = time("Count via DOM") do
    counts = Hash.new(0)
    doc.xpath('//cat').each do |cat|
      counts[cat.children[0].content] += 1
    end
    counts
  end
  # p counts
end

class CategoryCounter < Nokogiri::XML::SAX::Document
  attr_reader :category_counts
  def initialize
    @category_counts = Hash.new(0)
  end
  def start_element(name,att=nil)
    @count = name=='cat'
  end
  def characters(str)
    if @count
      @category_counts[str] += 1
      @count = false
    end
  end
end

def create_xml
  time("Create XML file") do
    File.open('tmp.xml','w') do |f|
      f << "<root>
      <summarysection><totalcount>10000</totalcount></summarysection>
      <items>
      #{
        ITEM_COUNT.times.map{ |i|
          "<item>
            <cat>#{CATEGORIES.sample}</cat>
            <name>Name #{i}</name>
            <name>Value #{i}</name>
          </item>"
        }.join("\n")
      }
      </items>
      </root>"
    end
  end
end

test! if __FILE__ == $0

DOM 计数是如何工作的？

如果我们去掉一些测试结构，基于 DOM 的计数器看起来像这样：

# Open the file on disk and pass it to Nokogiri so that it can stream read;
# Better than  doc = Nokogiri.XML(IO.read('tmp.xml'))
# which requires us to load a huge string into memory just to parse it
doc = File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) }

# Create a hash with default '0' values for any 'missing' keys
counts = Hash.new(0) 

# Find every `<cat>` element in the document (assumes one per <item>)
doc.xpath('//cat').each do |cat|
  # Get the child text node's content and use it as the key to the hash
  counts[cat.children[0].content] += 1
end

SAX 计数是如何工作的？

首先，让我们关注这段代码：

class CategoryCounter < Nokogiri::XML::SAX::Document
  attr_reader :category_counts
  def initialize
    @category_counts = Hash.new(0)
  end
  def start_element(name,att=nil)
    @count = name=='cat'
  end
  def characters(str)
    if @count
      @category_counts[str] += 1
      @count = false
    end
  end
end

创建这个类的新实例时，会得到一个哈希值默认为0的对象和几个可以调用的方法。当SAX解析器运行文档时，它将调用这些方法。

每次SAX解析器看到一个新元素时，它都会调用此类上的start_element方法。当发生这种情况时，我们根据该元素是否命名为“cat”（以便稍后找到其名称）设置标志。
每次SAX解析器读取一段文本时，它都会调用我们对象的characters方法。当发生这种情况时，我们检查我们最后看到的元素是否是类别（即如果@count被设置为true），如果是，则使用此文本节点的值作为类别名称并将计数器加一。

要在Nokogiri的SAX解析器中使用我们的自定义对象，请执行以下操作：

# Create a new instance, with its empty hash
counter = CategoryCounter.new

# Create a new parser that will call methods on our object, and then
# use `parse_file` so that it streams data from disk instead of flooding RAM
Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')

# Once that's done, we can get the hash of category counts back from our object
counts = counter.category_counts
p counts["Pigs"]