使用Java解析XML

Question

使用Java解析XML

3

我已经编写了一个解析XML文件的PHP脚本。这个脚本不太易用，因此我想在Java中实现它。

在第一个元素内部有各种数量的wfs:member元素，我需要循环遍历它们：

foreach ($data->children("wfs", true)->member as $member) { }

使用Java很容易做到这一点：

NodeList wfsMember = doc.getElementsByTagName("wfs:member");
for(int i = 0; i < wfsMember.getLength(); i++) { }

我已经按照以下方式打开了XML文件。

DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document doc = documentBuilder.parse(WeatherDatabaseUpdater.class.getResourceAsStream("wfs.xml"));

接下来我需要从一个名为observerdProperty的元素中获取一个属性。在PHP中，这很简单：

$member->
    children("omso", true)->PointTimeSeriesObservation->
    children("om", true)->observedProperty->
    attributes("xlink", true)->href

但是在Java中，我该怎么做呢？如果我想深入结构，是否需要使用getElementsByTagName并循环遍历它们？

在PHP中，整个脚本如下所示。

foreach ($data->children("wfs", true)->member as $member) {
    $dataType = $dataTypes[(string) $member->
                    children("omso", true)->PointTimeSeriesObservation->
                    children("om", true)->observedProperty->
                    attributes("xlink", true)->href];

    foreach ($member->
            children("omso", true)->PointTimeSeriesObservation->
            children("om", true)->result->
            children("wml2", true)->MeasurementTimeseries->
            children("wml2", true)->point as $point) {

        $time = $point->children("wml2", true)->MeasurementTVP->children("wml2", true)->time;
        $value = $point->children("wml2", true)->MeasurementTVP->children("wml2", true)->value;

        $data[$dataType][] = array($time, $value)
    }
}

在第二个foreach中，我循环遍历观测元素，并从中获取时间和值数据。然后将其保存在一个数组中。如果我需要以我描述的方式在Java中循环遍历元素，这将非常困难。我认为不是这种情况，因此有人能给我建议如何在Java中实现类似的东西吗？

- MikkoP

为什么不使用DOM解析器？ - iordanis

6个回答

4

您有几种实现Java XML解析的变体。

最常见的是：DOM，SAX，StAX。

每种方法都有其优缺点。使用Dom和Sax，您可以使用xsd模式验证您的xml。但是Stax可以在没有xsd验证的情况下工作，并且速度更快。

例如，xml文件：

<?xml version="1.0" encoding="UTF-8"?>
<staff xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:noNamespaceSchemaLocation="oldEmployee.xsd">
    <employee>
        <name>Carl Cracker</name>
        <salary>75000</salary>
        <hiredate year="1987" month="12" day="15" />
    </employee>
    <employee>
        <name>Harry Hacker</name>
        <salary>50000</salary>
        <hiredate year="1989" month="10" day="1" />
    </employee>
    <employee>
        <name>Tony Tester</name>
        <salary>40000</salary>
        <hiredate year="1990" month="3" day="15" />
    </employee>
</staff>

在我的印象中，最慢的实现方式是 DOM 解析器：

class DomXmlParser {    
    private Document document;
    List<Employee> empList = new ArrayList<>();

    public SchemaFactory schemaFactory;
    public final String JAXP_SCHEMA_LANGUAGE = "http://java.sun.com/xml/jaxp/properties/schemaLanguage";
    public final String W3C_XML_SCHEMA = "http://www.w3.org/2001/XMLSchema";    

    public DomXmlParser() {  
        try {
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            factory.setNamespaceAware(true);
            factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);
            DocumentBuilder builder = factory.newDocumentBuilder();
            document = builder.parse(new File(EMPLOYEE_XML.getFilename()));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }    

    public List<Employee> parseFromXmlToEmployee() {
        NodeList nodeList = document.getDocumentElement().getChildNodes();
        for (int i = 0; i < nodeList.getLength(); i++) {
            Node node = nodeList.item(i);

            if (node instanceof Element) {
                Employee emp = new Employee();

                NodeList childNodes = node.getChildNodes();
                for (int j = 0; j < childNodes.getLength(); j++) {
                    Node cNode = childNodes.item(j);

                    // identify the child tag of employees
                    if (cNode instanceof Element) {
                        switch (cNode.getNodeName()) {
                            case "name":
                                emp.setName(text(cNode));
                                break;
                            case "salary":
                                emp.setSalary(Double.parseDouble(text(cNode)));
                                break;
                            case "hiredate":
                                int yearAttr = Integer.parseInt(cNode.getAttributes().getNamedItem("year").getNodeValue());
                                int monthAttr =  Integer.parseInt(cNode.getAttributes().getNamedItem("month").getNodeValue());
                                int dayAttr =  Integer.parseInt(cNode.getAttributes().getNamedItem("day").getNodeValue());

                                emp.setHireDay(yearAttr, monthAttr - 1, dayAttr);
                                break;
                        }
                    }
                }
                empList.add(emp);
            }
        }
        return empList;
    }
    private String text(Node cNode) {
        return cNode.getTextContent().trim();
    }
}

SAX解析器：

class SaxHandler extends DefaultHandler {

    private Stack<String> elementStack = new Stack<>();
    private Stack<Object> objectStack = new Stack<>();

    public List<Employee> employees = new ArrayList<>();
    Employee employee = null;

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
        this.elementStack.push(qName);

        if ("employee".equals(qName)) {
            employee = new Employee();
            this.objectStack.push(employee);
            this.employees.add(employee);
        }
        if("hiredate".equals(qName))
        {
            int yearatt = Integer.parseInt(attributes.getValue("year"));
            int monthatt = Integer.parseInt(attributes.getValue("month"));
            int dayatt = Integer.parseInt(attributes.getValue("day"));

            if (employee != null) {
                employee.setHireDay(yearatt,  monthatt - 1,  dayatt) ;
            }
        }
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        this.elementStack.pop();

        if ("employee".equals(qName)) {
            Object objects = this.objectStack.pop();
        }
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        String value = new String(ch, start, length).trim();
        if (value.length() == 0) return;        // skip white space

        if ("name".equals(currentElement())) {
            employee = (Employee) this.objectStack.peek();
            employee.setName(value);
        } else if ("salary".equals(currentElement()) && "employee".equals(currentParrentElement())) {
            employee.setSalary(Double.parseDouble(value));
        }
    }

    private String currentElement() {
        return this.elementStack.peek();
    }

    private String currentParrentElement() {
        if (this.elementStack.size() < 2) return null;
        return this.elementStack.get(this.elementStack.size() - 2);
    }
}

Stax解析器：

class StaxXmlParser {
    private List<Employee> employeeList;
    private Employee currentEmployee;
    private String tagContent;
    private String attrContent;
    private XMLStreamReader reader;
    public StaxXmlParser(String filename) {
        employeeList = null;
        currentEmployee = null;
        tagContent = null;

        try {
            XMLInputFactory factory = XMLInputFactory.newFactory();
            reader = factory.createXMLStreamReader(new FileInputStream(new File(filename)));
            parseEmployee();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public List<Employee> parseEmployee() throws XMLStreamException {
        while (reader.hasNext()) {
            int event = reader.next();
            switch (event) {
                case XMLStreamConstants.START_ELEMENT:
                    if ("employee".equals(reader.getLocalName())) {
                        currentEmployee = new Employee();
                    }
                    if ("staff".equals(reader.getLocalName())) {
                        employeeList = new ArrayList<>();
                    }
                    if ("hiredate".equals(reader.getLocalName())) {
                        int yearAttr = Integer.parseInt(reader.getAttributeValue(null, "year"));
                        int monthAttr = Integer.parseInt(reader.getAttributeValue(null, "month"));
                        int dayAttr = Integer.parseInt(reader.getAttributeValue(null, "day"));

                        currentEmployee.setHireDay(yearAttr, monthAttr - 1, dayAttr);
                    }
                    break;

                case XMLStreamConstants.CHARACTERS:
                    tagContent = reader.getText().trim();
                    break;

                case XMLStreamConstants.ATTRIBUTE:
                    int count = reader.getAttributeCount();
                    for (int i = 0; i < count; i++) {
                        System.out.printf("count is: %d%n", count);
                    }
                    break;

                case XMLStreamConstants.END_ELEMENT:
                    switch (reader.getLocalName()) {
                        case "employee":
                            employeeList.add(currentEmployee);
                            break;
                        case "name":
                            currentEmployee.setName(tagContent);
                            break;
                        case "salary":
                            currentEmployee.setSalary(Double.parseDouble(tagContent));
                            break;
                    }
            }
        }
        return employeeList;
    }    
}

以下是一些 main() 测试：

 public static void main(String[] args) {
    long startTime, elapsedTime;
    Main main = new Main();

    startTime = System.currentTimeMillis();
    main.testSaxParser();   // test
    elapsedTime = System.currentTimeMillis() - startTime;
    System.out.println(String.format("Parsing time is: %d ms%n", elapsedTime / 1000));

    startTime = System.currentTimeMillis();
    main.testStaxParser();  // test
    elapsedTime = System.currentTimeMillis() - startTime;
    System.out.println(String.format("Parsing time is: %d ms%n", elapsedTime / 1000));

    startTime = System.currentTimeMillis();
    main.testDomParser();  // test
    elapsedTime = System.currentTimeMillis() - startTime;
    System.out.println(String.format("Parsing time is: %d ms%n", elapsedTime / 1000));
}

输出：

Using SAX Parser:
-----------------
Employee { name=Carl Cracker, salary=75000.0, hireDay=Tue Dec 15 00:00:00 EET 1987 }
Employee { name=Harry Hacker, salary=50000.0, hireDay=Sun Oct 01 00:00:00 EET 1989 }
Employee { name=Tony Tester, salary=40000.0, hireDay=Thu Mar 15 00:00:00 EET 1990 }
Parsing time is: 106 ms

Using StAX Parser:
------------------
Employee { name=Carl Cracker, salary=75000.0, hireDay=Tue Dec 15 00:00:00 EET 1987 }
Employee { name=Harry Hacker, salary=50000.0, hireDay=Sun Oct 01 00:00:00 EET 1989 }
Employee { name=Tony Tester, salary=40000.0, hireDay=Thu Mar 15 00:00:00 EET 1990 }
Parsing time is: 5 ms

Using DOM Parser:
-----------------
Employee { name=Carl Cracker, salary=75000.0, hireDay=Tue Dec 15 00:00:00 EET 1987 }
Employee { name=Harry Hacker, salary=50000.0, hireDay=Sun Oct 01 00:00:00 EET 1989 }
Employee { name=Tony Tester, salary=40000.0, hireDay=Thu Mar 15 00:00:00 EET 1990 }
Parsing time is: 13 ms

你可以在这些变体中看到一些样例。

但在Java中还存在其他的方法，例如JAXB——你需要拥有xsd模式，并根据该模式生成类。之后，你可以使用unmarchal()从xml文件中读取内容：

public class JaxbDemo {
    public static void main(String[] args) {
        try {
            long startTime = System.currentTimeMillis();
            // create jaxb and instantiate marshaller
            JAXBContext context = JAXBContext.newInstance(Staff.class.getPackage().getName());
            FileInputStream in = new FileInputStream(new File(Files.EMPLOYEE_XML.getFilename()));

            System.out.println("Output from employee XML file");
            Unmarshaller um = context.createUnmarshaller();
            Staff staff = (Staff) um.unmarshal(in);

            // print employee list
            for (Staff.Employee emp : staff.getEmployee()) {
                System.out.println(emp);
            }

            long elapsedTime = System.currentTimeMillis() - startTime;
            System.out.println(String.format("Parsing time is: %d ms%n", elapsedTime));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

我之前尝试了一种方法，结果如下：

Employee { name='Carl Cracker', salary=75000, hiredate=1987-12-15 } }
Employee { name='Harry Hacker', salary=50000, hiredate=1989-10-1 } }
Employee { name='Tony Tester', salary=40000, hiredate=1990-3-15 } }
Parsing time is: 320 ms

我添加了另一个toString()方法，它有不同的雇用日期格式。

这里有一些对你有兴趣的链接：

- catch23

感谢您的长篇详细回答！很高兴您比较了各种选项，但我决定将奖励授予这里最简单的解决方案。不用担心，我也给您点赞了！ - MikkoP

3

通过递归解析DOM

使用 DOM 解析器可能会陷入嵌套的 for 循环中，正如您已经指出的那样。然而，DOM 结构由包含子节点集合的 NodeList 的 Node 表示，其中每个元素再次是一个 Node - 这成为一个完美的候选对象，可以使用递归进行解析。

示例 XML

为了展示 DOM 解析器的能力，我选择了一个托管的示例OpenWeatherMap XML，忽略了 XML 的大小。

按城市名称搜索 XML 格式

这个 XML 包含伦敦每隔 3 小时的天气预报。这个 XML 是读取相对较大数据集并通过子元素内的属性提取特定信息的良好案例。

enter image description here

在此快照中，我们的目标是收集由箭头标记的 Elements。

代码

我们首先创建一个自定义类来保存温度和云量值。我们还将重写此自定义类的 toString() 方法，以便方便地打印记录。

ForeCast.java

public class ForeCast {

    /**
     * Overridden toString() to conveniently print the results
     */
    @Override
    public String toString() {
        return "The minimum temperature is: " + getTemperature()
                + " and the weather overall: " + getClouds();
    }

    public String getTemperature() {
        return temperature;
    }

    public void setTemperature(String temperature) {
        this.temperature = temperature;
    }

    public String getClouds() {
        return clouds;
    }

    public void setClouds(String clouds) {
        this.clouds = clouds;
    }

    private String temperature;
    private String clouds;
}

现在进入主类。在我们执行递归的主类中，我们希望通过遍历整个XML来创建一个List对象，其中包含存储了单独的温度和云量记录的ForeCast对象。

// List collection which is would hold all the data parsed through the XML
// in the format defined by the custom type 'ForeCast'
private static List<ForeCast> forecastList = new ArrayList<>();

在XML中，温度和云的父元素都是time，我们应该逻辑上检查时间元素。

/**
 * Logical block
 */
// As per the XML syntax our 2 fields temperature and clouds come
// directly under the Node/Element time
if (node.getNodeName().equals("time")
        && node.getNodeType() == Node.ELEMENT_NODE) {
    // Instantiate our custom forecast object
    forecastObj = new ForeCast();
    Element timeElement = (Element) node;

接下来，我们将掌握可以设置到ForeCast对象的温度和云元素。

    // Get the temperature element by its tag name within the XML (0th
    // index known)
    Element tempElement = (Element) timeElement.getElementsByTagName("temperature").item(0);
    // Minimum temperature value is selectively picked (for proof of concept)
    forecastObj.setTemperature(tempElement.getAttribute("min"));

    // Similarly get the clouds element
    Element cloudElement = (Element) timeElement.getElementsByTagName("clouds").item(0);
    forecastObj.setClouds(cloudElement.getAttribute("value"));

以下是完整的类：

CustomDomXmlParser.java

import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

public class CustomDomXmlParser {

    // List collection which is would hold all the data parsed through the XML
    // in the format defined by the custom type 'ForeCast'
    private static List<ForeCast> forecastList = new ArrayList<>();

    public static void main(String[] args) throws ParserConfigurationException,
            SAXException, IOException {
        // Read XML throuhg a URL (a FileInputStream can be used to pick up an
        // XML file from the file system)
        InputStream path = new URL(
                "http://api.openweathermap.org/data/2.5/forecast?q=London,us&mode=xml")
                .openStream();

        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = factory.newDocumentBuilder();
        Document document = builder.parse(path);

        // Call to the recursive method with the parent node
        traverse(document.getDocumentElement());

        // Print the List values collected within the recursive method
        for (ForeCast forecastObj : forecastList)
            System.out.println(forecastObj);

    }

    /**
     * 
     * @param node
     */
    public static void traverse(Node node) {
        // Get the list of Child Nodes immediate to the current node
        NodeList list = node.getChildNodes();

        // Declare our local instance of forecast object
        ForeCast forecastObj = null;

        /**
         * Logical block
         */
        // As per the XML syntax our 2 fields temperature and clouds come
        // directly under the Node/Element time
        if (node.getNodeName().equals("time")
                && node.getNodeType() == Node.ELEMENT_NODE) {

            // Instantiate our custom forecast object
            forecastObj = new ForeCast();
            Element timeElement = (Element) node;

            // Get the temperature element by its tag name within the XML (0th
            // index known)
            Element tempElement = (Element) timeElement.getElementsByTagName(
                    "temperature").item(0);
            // Minimum temperature value is selectively picked (for proof of
            // concept)
            forecastObj.setTemperature(tempElement.getAttribute("min"));

            // Similarly get the clouds element
            Element cloudElement = (Element) timeElement.getElementsByTagName(
                    "clouds").item(0);
            forecastObj.setClouds(cloudElement.getAttribute("value"));
        }

        // Add our foreCastObj if initialized within this recursion, that is if
        // it traverses the time node within the XML, and not in any other case
        if (forecastObj != null)
            forecastList.add(forecastObj);

        /**
         * Recursion block
         */
        // Iterate over the next child nodes
        for (int i = 0; i < list.getLength(); i++) {
            Node currentNode = list.item(i);
            // Recursively invoke the method for the current node
            traverse(currentNode);

        }

    }
}

输出结果

从下面的截图中可以看出，我们能够将这2个特定元素分组，并将它们的值有效地分配给一个Java集合实例。我们将xml的复杂解析委托给通用的递归解决方案，并主要自定义了逻辑块部分。如上所述，这是一种最小定制化的通用解决方案，可通过所有有效的xmls进行操作。

enter image description here

替代方案

许多其他替代方案可用，以下是一份Java开源XML解析器列表。

然而，您使用PHP的方法和最初使用基于Java的解析器与DOM解析器解决方案相符，简化了递归的使用。

- StoopidDonut

谢谢你的建议。虽然我决定不使用递归，因为似乎还有更简单的替代方案。 - MikkoP

0

我不建议你自己实现XML解析的解析函数，因为已经有很多选项了。我的建议是使用DOM解析器。你可以在以下链接中找到一些示例。（你也可以从其他可用选项中选择）

http://www.javacodegeeks.com/2013/05/parsing-xml-using-dom-sax-and-stax-parser-in-java.html

您可以使用诸如

命令

eElement.getAttribute("id");

来源：http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/

- iordanis

0

我同意已经发布的关于不要自己实现解析函数的观点。

但是，与其使用DOM/SAX/STAX解析器，我建议使用JDOM或XOM这些外部库。