如何在Nokogiri中使用SAX?

3

我希望解析一个非常大的文件,大小为240Mb,并且必须使用SAX以避免在内存中加载文件。

我的XML看起来像这样:

<?xml version="1.0" encoding="utf-8"?>
<hotels>
  <hotel>
    <hotelId>1568054</hotelId>
    <hotelFileName>Der_Obere_Wirt_zum_Queri</hotelFileName>
    <hotelName>"Der Obere Wirt" zum Queri</hotelName>
    <rating>3</rating>
    <cityId>34633</cityId>
    <cityFileName>Andechs</cityFileName>
    <cityName>Andechs</cityName>
    <stateId>212</stateId>
    <stateFileName>Bavaria</stateFileName>
    <stateName>Bavaria</stateName>
    <countryCode>DE</countryCode>
    <countryFileName>Germany</countryFileName>
    <countryName>Germany</countryName>
    <imageId>51498149</imageId>
    <Address>Georg Queri Ring 9</Address>
    <minRate>85.9800</minRate>
    <currencyCode>EUR</currencyCode>
    <Latitude>48.009423000000</Latitude>
    <Longitude>11.214504000000</Longitude>
    <NumberOfReviews>16</NumberOfReviews>
    <ConsumerRating>4.25</ConsumerRating>
    <PropertyType>0</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>1|3|5|8|22|27|45|49|53|56|64|66|67|139|202|209|213|256|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1658359</hotelId>
    <hotelFileName>Seclusions_of_Yallingup</hotelFileName>
    <hotelName>"Seclusions" of Yallingup</hotelName>
    <rating>4</rating>
    <cityId>72257</cityId>
    <cityFileName>Yallingup</cityFileName>
    <cityName>Yallingup</cityName>
    <stateId>172</stateId>
    <stateFileName>Western_Australia</stateFileName>
    <stateName>Western Australia</stateName>
    <countryCode>AU</countryCode>
    <countryFileName>Australia</countryFileName>
    <countryName>Australia</countryName>
    <imageId>53234107</imageId>
    <Address>58 Zamia Grove</Address>
    <minRate>218.1825</minRate>
    <currencyCode>AUD</currencyCode>
    <Latitude>-33.691192000000</Latitude>
    <Longitude>115.061938999999</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>3</PropertyType>
    <ChainID>0</ChainID>
     <Facilities>3|6|13|14|21|22|28|39|40|41|51|53|54|56|57|58|65|66|141|191|202|204|209|210|211|292|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1491947</hotelId>
    <hotelFileName>1_Melrose_Blvd</hotelFileName>
    <hotelName>#1 Melrose Blvd</hotelName>
    <rating>5</rating>
    <cityId>964</cityId>
    <cityFileName>Johannesburg</cityFileName>
    <cityName>Johannesburg</cityName>
    <stateId/>
    <stateFileName/>
    <stateName/>
    <countryCode>ZA</countryCode>
    <countryFileName>South_Africa</countryFileName>
    <countryName>South Africa</countryName>
    <imageId>46777171</imageId>
    <Address>1 Melrose Boulevard Melrose Arch</Address>
    <minRate/>
    <currencyCode>ZAR</currencyCode>
    <Latitude>-26.135656000000</Latitude>
    <Longitude>28.067751000000</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>9</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>6|7|9|11|12|15|17|18|21|32|34|39|41|42|50|51|56|58|60|140|173|202|293|296|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1726938</hotelId>
    <hotelFileName>1_Value_Inn_Clovis</hotelFileName>
    <hotelName>#1 Value Inn Clovis</hotelName>
    <rating>2</rating>
    <cityId>28538</cityId>
    <cityFileName>Clovis_New_Mexico</cityFileName>
    <cityName>Clovis (New Mexico)</cityName>
    <stateId>32</stateId>
    <stateFileName>New_Mexico</stateFileName>
    <stateName>New Mexico</stateName>
    <countryCode>US</countryCode>
    <countryFileName>United_States</countryFileName>
    <countryName>United States</countryName>
    <imageId/>
    <Address>1720 Mabry</Address>
    <minRate/>
    <currencyCode>USD</currencyCode>
    <Latitude>34.396549224853</Latitude>
    <Longitude>-103.182769775390</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>2</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>6|7|8|18|21|22|27|41|50|52|56|222|281|292|</Facilities>
  </hotel>
</hotels>

我尝试了下面这段代码:
class Wikihandler  < Nokogiri::XML::SAX::Document

  def initialize
    # do one-time setup here, called as part of Class.new
  end

  def start_element(name, attributes = [])
  # check the element name here and create an active record object if appropriate
   if name == 'hotel'
    a = Hash[*attributes]
    puts attributes
    # more business...
   end
  end

  def characters(s)
     # save the characters that appear here and possibly use them in the current tag object
  end

  def end_element(name)
     # check the tag name and possibly use the characters you've collected
     # and save your activerecord object now
  end

end

parser = Nokogiri::XML::SAX::Parser.new(Wikihandler.new)
parser.parse_file('HotelCombinedXml/Hotels_All.xml')

我可以访问标签的标签名,但是如何获取它的内容呢?
1个回答

9

Wikihandler#characters会显示内容。你可以这样做:

class MyDocument < Nokogiri::XML::SAX::Document
  attr_accessor :is_name

  def initialize
    @is_name = false
  end

  def end_document
    puts "the document has ended"
  end

  def start_element name, attributes = []
    @is_name = name.eql?("hotelName")
  end

  def characters string
    string.strip!
    if @is_name and !string.empty?
      puts "Name: #{string}"
    end
  end
end

然而,如果您希望生活更轻松,我建议您查看sax-machine。它为Nokogiri的SAX解析器添加了一些不错的功能和(在我看来)更友好的界面。以下是一些示例代码和规范:

require "sax-machine"
require "rspec"

XML = <<XML
<?xml version="1.0" encoding="utf-8"?>
<hotels>
  <hotel>
    <hotelId>1568054</hotelId>
    <hotelFileName>Der_Obere_Wirt_zum_Queri</hotelFileName>
    <hotelName>"Der Obere Wirt" zum Queri</hotelName>
    <rating>3</rating>
    <cityId>34633</cityId>
    <cityFileName>Andechs</cityFileName>
    <cityName>Andechs</cityName>
    <stateId>212</stateId>
    <stateFileName>Bavaria</stateFileName>
    <stateName>Bavaria</stateName>
    <countryCode>DE</countryCode>
    <countryFileName>Germany</countryFileName>
    <countryName>Germany</countryName>
    <imageId>51498149</imageId>
    <Address>Georg Queri Ring 9</Address>
    <minRate>85.9800</minRate>
    <currencyCode>EUR</currencyCode>
    <Latitude>48.009423000000</Latitude>
    <Longitude>11.214504000000</Longitude>
    <NumberOfReviews>16</NumberOfReviews>
    <ConsumerRating>4.25</ConsumerRating>
    <PropertyType>0</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>1|3|5|8|22|27|45|49|53|56|64|66|67|139|202|209|213|256|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1658359</hotelId>
    <hotelFileName>Seclusions_of_Yallingup</hotelFileName>
    <hotelName>"Seclusions" of Yallingup</hotelName>
    <rating>4</rating>
    <cityId>72257</cityId>
    <cityFileName>Yallingup</cityFileName>
    <cityName>Yallingup</cityName>
    <stateId>172</stateId>
    <stateFileName>Western_Australia</stateFileName>
    <stateName>Western Australia</stateName>
    <countryCode>AU</countryCode>
    <countryFileName>Australia</countryFileName>
    <countryName>Australia</countryName>
    <imageId>53234107</imageId>
    <Address>58 Zamia Grove</Address>
    <minRate>218.1825</minRate>
    <currencyCode>AUD</currencyCode>
    <Latitude>-33.691192000000</Latitude>
    <Longitude>115.061938999999</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>3</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>3|6|13|14|21|22|28|39|40|41|51|53|54|56|57|58|65|66|141|191|202|204|209|210|211|292|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1491947</hotelId>
    <hotelFileName>1_Melrose_Blvd</hotelFileName>
    <hotelName>#1 Melrose Blvd</hotelName>
    <rating>5</rating>
    <cityId>964</cityId>
    <cityFileName>Johannesburg</cityFileName>
    <cityName>Johannesburg</cityName>
    <stateId/>
    <stateFileName/>
    <stateName/>
    <countryCode>ZA</countryCode>
    <countryFileName>South_Africa</countryFileName>
    <countryName>South Africa</countryName>
    <imageId>46777171</imageId>
    <Address>1 Melrose Boulevard Melrose Arch</Address>
    <minRate/>
    <currencyCode>ZAR</currencyCode>
    <Latitude>-26.135656000000</Latitude>
    <Longitude>28.067751000000</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>9</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>6|7|9|11|12|15|17|18|21|32|34|39|41|42|50|51|56|58|60|140|173|202|293|296|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1726938</hotelId>
    <hotelFileName>1_Value_Inn_Clovis</hotelFileName>
    <hotelName>#1 Value Inn Clovis</hotelName>
    <rating>2</rating>
    <cityId>28538</cityId>
    <cityFileName>Clovis_New_Mexico</cityFileName>
    <cityName>Clovis (New Mexico)</cityName>
    <stateId>32</stateId>
    <stateFileName>New_Mexico</stateFileName>
    <stateName>New Mexico</stateName>
    <countryCode>US</countryCode>
    <countryFileName>United_States</countryFileName>
    <countryName>United States</countryName>
    <imageId/>
    <Address>1720 Mabry</Address>
    <minRate/>
    <currencyCode>USD</currencyCode>
    <Latitude>34.396549224853</Latitude>
    <Longitude>-103.182769775390</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>2</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>6|7|8|18|21|22|27|41|50|52|56|222|281|292|</Facilities>
  </hotel>
</hotels>
XML

class Hotel
  include SAXMachine
  element :hotelId, :as => :id
  element :hotelName, :as => :name
end

class Wikihandler
  include SAXMachine
  elements :hotel, :as => :hotels, :class => Hotel
end

describe Wikihandler do
  before(:all) do
    @parser = Wikihandler.new
    @parser.parse XML
  end

  it "should parse the proper number of hotels" do
    @parser.hotels.count.should eq 4
  end

  it "should parse the hotel id of each entry" do
    @parser.hotels[0].id.should eq "1568054"
  end

  it "should parse the hotel name of each entry" do
    @parser.hotels[0].name.should eq '"Der Obere Wirt" zum Queri'
  end
end

Sax Machine 仍然尝试先读取整个文档,这对于较大的文件是行不通的。 :( - unflores
你救了我的一天!非常感谢! - sadfuzzy

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接