Java：从XML中删除CDATA标记

Question

Java：从XML中删除CDATA标记

8

使用XPath解析XML文件非常方便，但是它不能解析CDATA标签内的数据：

<![CDATA[ Some Text <p>more text and tags</p>... ]]>

我的解决方案：首先获取XML的内容，然后删除。

"<![CDATA["  and  "]]>".

接下来我将从xml文件中运行xpath“到达所有内容”。是否有更好的解决方案？如果没有，如何使用正则表达式实现？

- SandyBr

1

移除 CDATA 可能会使您的 XML 无效（并且对于处理目的可能是无用的）。 - Amol Katdare

1

正则表达式和XML不兼容。请阅读https://dev59.com/X3I-5IYBdhLWcg3wq6do。 - Jim Garrison

那么，获取标题、描述、发布时间以及同时从rss xml文件中获取cdata内容的解决方案是什么？实际上，我需要从CDATA中获取图像链接。 - SandyBr

5个回答

2

为了去掉CDATA并保留标签，你可以使用XSLT。

给定以下XML输入：

<?xml version="1.0" encoding="ISO-8859-1"?>
<root>
    <child>Here is some text.</child>
    <child><![CDATA[Here is more text <p>with tags</p>.]]></child>
</root>

使用这个XSLT：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">

    <xsl:output method="xml" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*" />
            <xsl:value-of select="text()" disable-output-escaping="yes"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

将返回以下XML：

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <child>Here is some text.</child>
   <child>Here is more text <p>with tags</p>.</child>
</root>

（在oXygen 12.2中使用Saxon HE 9.3.0.5测试通过）

然后，您可以使用xPath提取p元素的内容：

/root/child/p

- james.garriss

1

我需要完成相同的任务。我用两个XSLT解决了它。

请让我强调，这只有在CDATA是格式良好的XML时才有效。

为了完整起见，让我向您的示例XML添加一个根元素：

<root>
   <well-formed-content><![CDATA[ Some Text <p>more text and tags</p>]]>
   </well-formed-content>
</root>

图1. - 开始的XML

第一步

在第一个转换步骤中，我已经将所有文本节点包装在一个新引入的 XML 实体 old_text 中：

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="no" version="1.0"
    encoding="UTF-8" standalone="yes" />

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*|text()|@*|comment()|processing-instruction()" />
        </xsl:copy>
    </xsl:template>

    <!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
    <xsl:template match="@*|comment()|processing-instruction()">
        <xsl:copy-of select="." />
    </xsl:template>

    <!-- Text-nodes: Wrap them in a new node without escaping it. -->
    <!-- (note precondition: CDATA should be valid xml.           -->
    <xsl:template match="text()">
        <xsl:element name="old_text">
            <xsl:value-of select="." disable-output-escaping="yes" />
        </xsl:element>
    </xsl:template>

</xsl:stylesheet>

图2.- 第一个xslt（在“old_text”元素中包装CDATA）

如果您将此转换应用于起始xml，则会得到以下结果（我不会重新格式化它以避免对谁做什么的混淆）：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><old_text>
    </old_text><well-formed-content><old_text> Some Text <p>more text and tags</p>
    </old_text></well-formed-content><old_text>
</old_text></root>

图3.- 转换后的XML（第一步）

第二步

现在您需要清理引入的old_text元素，并重新转义未创建新节点的文本：

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="no" version="1.0"
    encoding="UTF-8" standalone="yes" />

    <!-- Element-nodes: Process nodes and their children -->
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*|text()|@*|comment()" />
        </xsl:copy>
    </xsl:template>

    <!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
    <xsl:template match="@*|comment()">
        <xsl:copy-of select="." />
    </xsl:template>

    <!--
        'Wrapper'-node: remove the wrapper element but process its children.
        With this matcher, the "old_text" is cleaned, but the originally CDATA
        well-formed nodes surface in the resulting xml.
    -->
    <xsl:template match="old_text">
        <xsl:apply-templates select="*|text()" />
    </xsl:template>

    <!--
        Text-nodes: Text here comes from original CDATA and must be now
        escaped. Note that the previous rule has extracted all the existing
        nodes in the CDATA. -->
    <xsl:template match="text()">
        <xsl:value-of select="." disable-output-escaping="no" />
    </xsl:template>

</xsl:stylesheet>

图4.- 第二个xslt（清理后的人工引入元素）

结果

这是最终结果，原本在CDATA中的节点已经在您的新XML文件中展开：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root>
    <well-formed-content> Some Text <p>more text and tags</p>
    </well-formed-content>
</root>

图5.- 最终的xml

注意事项

如果您的CDATA包含在XML中不支持的HTML字符实体（例如，请参阅此wikipedia有关字符实体的文章），则需要将这些引用添加到您的中间XML中。让我举个例子：

<root>
    <well-formed-content>
        <![CDATA[ Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop:&nbsp;.]]>
    </well-formed-content>
</root>

图6.- 在图1中添加了字符实体 

图2中的原始xslt将把xml转换为以下内容：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><old_text>
    </old_text><well-formed-content><old_text>
        Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop:&nbsp;.
    </old_text></well-formed-content><old_text>
</old_text></root>

图7.- 尝试将图6中的xml转换后的结果（格式不正确！）

这个文件的问题在于它的格式不正确，因此不能使用XSLT处理器进一步处理：

引用了实体“nbsp”，但未声明。
XML检查完成。

图8.- 对图7中的xml进行格式正确性检查的结果

这个解决方法很简单（match="/"模板添加了 实体）：

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="no" version="1.0"
                encoding="UTF-8" standalone="yes" />

    <!-- Add an html entity to the xml character entities declaration. -->
    <xsl:template match="/">
        <xsl:text disable-output-escaping="yes"><![CDATA[<!DOCTYPE root
[
    <!ENTITY nbsp "&#160;">
]>
]]>
        </xsl:text>
        <xsl:apply-templates select="*" />
    </xsl:template>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*|text()|@*|comment()|processing-instruction()" />
        </xsl:copy>
    </xsl:template>

    <!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
    <xsl:template match="@*|comment()|processing-instruction()">
        <xsl:copy-of select="." />
    </xsl:template>

    <!-- Text-nodes: Wrap them in a new node without escaping it. -->
    <!-- (note precondition: CDATA should be valid xml.           -->
    <xsl:template match="text()">
        <xsl:element name="old_text">
            <xsl:value-of select="." disable-output-escaping="yes" />
        </xsl:element>
    </xsl:template>

</xsl:stylesheet>

图9.- XSLT创建实体声明

现在，在将此XSLT应用于图6源XML后，这是中间XML：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><!DOCTYPE root
[
    <!ENTITY nbsp "&#160;">
]>

        <root><old_text>
    </old_text><well-formed-content><old_text>
        Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop:&nbsp;.
    </old_text></well-formed-content><old_text>
</old_text></root>

图10.- 中级xml（来自图3加实体声明的xml）

你可以使用图4中的xslt转换来生成最终的xml：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root>
    <well-formed-content>
        Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop: .
    </well-formed-content>
</root>

图11.- 最终的 XML，其中 HTML 实体已转换为 UTF-8

注释

在这些示例中，我使用了NetBeans 7.1.2内置的XSLT处理器(com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl - 默认JRE XSLT处理器)

免责声明：我不是XML专家。我感觉这应该更容易...

- Alberto

1

你可以使用正则表达式从XML中删除cdata，通过删除XML中所需的内容来完成。

例如：

String s = "<sn><![CDATA[poctest]]></sn>";
s = s.replaceAll("!\\[CDATA", "");
s = s.replaceAll("]]", "");
s = s.replaceAll("\\[", "");

Result will be:

<sn><poctest></sn>

请检查，如果这解决了您的问题。

- mukesh.stackOverflow

0

试试这个：

public static removeCDATA (String text) {
    String resultString = "";
    Pattern regex = Pattern.compile("(?<!(<!\\[CDATA\\[))|((.*)\\w+\\W)");
    Matcher regexMatcher = regex.matcher(text);
    while (regexMatcher.find()) {
        resultString += regexMatcher.group();
    }
    return resultString;
}

当我使用您的测试输入<![CDATA[ Some Text <p>more text and tags</p>... ]]>调用此方法时，该方法将返回Some Text <p>more text and tags</p>

但我认为这种没有正则表达式的方法会更可靠。像这样：

public static removeCDATA (String text) {
    s = s.trim();
    if (s.startsWith("<![CDATA[")) {
        s = s.substring(9);
        int i = s.indexOf("]]>");
        if (i == -1) throw new IllegalStateException("argument starts with <![CDATA[ but cannot find pairing ]]>");
        s = s.substring(0, i);
    }
    return s;
}

- thomas.adamjak

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Paŭlo Ebermann · Accepted Answer

使用CDATA标记的原因是其中的所有内容都是纯文本，没有任何应该直接解释为XML的内容。你也可以将问题中的文档片段写成以下形式：

 Some Text &lt;p&gt;more text and tags&lt;/p&gt;...

（前后带有空格）。

如果您确实想将其解释为XML，请从您的文档中提取文本，并再次将其提交给XML解析器。