处理无效的XML十六进制字符

22

我试图通过网络发送一个XML文档,但是收到了以下异常:

"MY LONG EMAIL STRING" was specified for the 'Body' element. ---> System.ArgumentException: '', hexadecimal value 0x02, is an invalid character.
   at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
   at System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
   at System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
   at System.Xml.XmlUtf8RawTextWriterIndent.WriteString(String text)
   at System.Xml.XmlRawWriter.WriteValue(String value)
   at System.Xml.XmlWellFormedWriter.WriteValue(String value)
   at Microsoft.Exchange.WebServices.Data.EwsServiceXmlWriter.WriteValue(String value, String name)
   --- End of inner exception stack trace ---

由于字符串是从电子邮件中获取的,我无法控制想要发送的内容。如何编码我的字符串,使其成为有效的 XML,同时保留非法字符?

我希望以某种方式保留原始字符。


2
这取决于非法字符是像x0这样的XML根本无法处理的字符,还是像<这样只需要转义的字符。 - Michael Kay
8个回答

25
下面的代码从字符串中删除 XML 无效字符,并返回一个新的不含该字符的字符串:
public static string CleanInvalidXmlChars(string text) 
{ 
     // From xml spec valid chars: 
     // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]     
     // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. 
     string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]"; 
     return Regex.Replace(text, re, ""); 
}

6
因为最后的 x10FFFF 没有被转义,所以这段代码不能正常运行。 请参考以下答案中的更好的正则表达式:https://dev59.com/6HRC5IYBdhLWcg3wJNif#961504 - Sean Clifford

16
byte[] toEncodeAsBytes
            = System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode);
      string returnValue
            = System.Convert.ToBase64String(toEncodeAsBytes);

这是一种实现方式。


10

使用C#以及XmlConvert.IsXmlChar 方法(自 .NET Framework 4.0 起可用)是另一种移除不正确XML字符的方法。

public static string RemoveInvalidXmlChars(string content)
{
   return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}

.Net Fiddle - https://dotnetfiddle.net/v1TNus

例如,垂直制表符(\v)在XML中无效,虽然它在UTF-8中是有效的,但在XML 1.0中是无效的,甚至许多库(包括libxml2)都会忽略它并静默输出无效的XML。


7

我的工作内容:

XmlWriterSettings xmlWriterSettings = new XmlWriterSettings { Encoding = Encoding.UTF8, CheckCharacters = false };

1
在设置中将“CheckCharacters = true”设置为真,这对我很有帮助。谢谢! - RobSiklos
我可以在哪里购买它?[https://stackoverflow.com/questions/57287428/hexadecimal-value-0x02-is-an-invalid-character-while-exporting-an-xml-file] - user10888875

6
以下解决方案可以删除任何无效的XML字符,但是我认为它的性能非常好,并且特别是在字符串没有任何无效字符的情况下,它不会分配新的StringBuilder和新字符串,直到已确定字符串中有任何无效字符。 因此,热点最终只是在字符上的单个for循环,检查最终通常不超过每个char上的两个大于/小于数字比较。 如果找不到,则只返回原始字符串。 当绝大多数字符串都很好时,这非常有帮助,它很快地将其作为输入和输出(没有浪费的分配等)。

- 更新 -

请参见下面如何直接编写具有这些无效字符的XElement,尽管它使用了此代码 -

其中部分代码受到Mr. Tom Bogle的解决方案的影响。 还请参见同一线程上superlogical的帖子中的有用信息。 然而,所有这些都总是实例化新的StringBuilder和string。

用法:

    string xmlStrBack = XML.ToValidXmlCharactersString("any string");

测试:

    public static void TestXmlCleanser()
    {
        string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'
        string goodString = "My name is Inigo Montoya!";

        string back1 = XML.ToValidXmlCharactersString(badString); // fixes it
        string back2 = XML.ToValidXmlCharactersString(goodString); // returns same string

        XElement x1 = new XElement("test", back1);
        XElement x2 = new XElement("test", back2);
        XElement x3WithBadString = new XElement("test", badString);

        string xml1 = x1.ToString();
        string xml2 = x2.ToString().Print();

        string xmlShouldFail = x3WithBadString.ToString();
    }

// --- 代码 --- (我有这些方法在一个名为XML的静态实用类中)

    /// <summary>
    /// Determines if any invalid XML 1.0 characters exist within the string,
    /// and if so it returns a new string with the invalid chars removed, else 
    /// the same string is returned (with no wasted StringBuilder allocated, etc).
    /// </summary>
    /// <param name="s">Xml string.</param>
    /// <param name="startIndex">The index to begin checking at.</param>
    public static string ToValidXmlCharactersString(string s, int startIndex = 0)
    {
        int firstInvalidChar = IndexOfFirstInvalidXMLChar(s, startIndex);
        if (firstInvalidChar < 0)
            return s;

        startIndex = firstInvalidChar;

        int len = s.Length;
        var sb = new StringBuilder(len);

        if (startIndex > 0)
            sb.Append(s, 0, startIndex);

        for (int i = startIndex; i < len; i++)
            if (IsLegalXmlChar(s[i]))
                sb.Append(s[i]);

        return sb.ToString();
    }

    /// <summary>
    /// Gets the index of the first invalid XML 1.0 character in this string, else returns -1.
    /// </summary>
    /// <param name="s">Xml string.</param>
    /// <param name="startIndex">Start index.</param>
    public static int IndexOfFirstInvalidXMLChar(string s, int startIndex = 0)
    {
        if (s != null && s.Length > 0 && startIndex < s.Length) {

            if (startIndex < 0) startIndex = 0;
            int len = s.Length;

            for (int i = startIndex; i < len; i++)
                if (!IsLegalXmlChar(s[i]))
                    return i;
        }
        return -1;
    }

    /// <summary>
    /// Indicates whether a given character is valid according to the XML 1.0 spec.
    /// This code represents an optimized version of Tom Bogle's on SO: 
    /// https://stackoverflow.com/a/13039301/264031.
    /// </summary>
    public static bool IsLegalXmlChar(char c)
    {
        if (c > 31 && c <= 55295)
            return true;
        if (c < 32)
            return c == 9 || c == 10 || c == 13;
        return (c >= 57344 && c <= 65533) || c > 65535;
        // final comparison is useful only for integral comparison, if char c -> int c, useful for utf-32 I suppose
        //c <= 1114111 */ // impossible to get a code point bigger than 1114111 because Char.ConvertToUtf32 would have thrown an exception
    }

======== ======== ========

直接使用XElement.ToString

======== ======== ========

首先,这是一个扩展方法的用法:

string result = xelem.ToStringIgnoreInvalidChars();

-- Fuller测试 --
    public static void TestXmlCleanser()
    {
        string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya'

        XElement x = new XElement("test", badString);

        string xml1 = x.ToStringIgnoreInvalidChars();                               
        //result: <test>My name is Inigo Montoya</test>

        string xml2 = x.ToStringIgnoreInvalidChars(deleteInvalidChars: false);
        //result: <test>My name is Inigo Mont&#x1E;oya</test>
    }

--- 代码 ---

    /// <summary>
    /// Writes this XML to string while allowing invalid XML chars to either be
    /// simply removed during the write process, or else encoded into entities, 
    /// instead of having an exception occur, as the standard XmlWriter.Create 
    /// XmlWriter does (which is the default writer used by XElement).
    /// </summary>
    /// <param name="xml">XElement.</param>
    /// <param name="deleteInvalidChars">True to have any invalid chars deleted, else they will be entity encoded.</param>
    /// <param name="indent">Indent setting.</param>
    /// <param name="indentChar">Indent char (leave null to use default)</param>
    public static string ToStringIgnoreInvalidChars(this XElement xml, bool deleteInvalidChars = true, bool indent = true, char? indentChar = null)
    {
        if (xml == null) return null;

        StringWriter swriter = new StringWriter();
        using (XmlTextWriterIgnoreInvalidChars writer = new XmlTextWriterIgnoreInvalidChars(swriter, deleteInvalidChars)) {

            // -- settings --
            // unfortunately writer.Settings cannot be set, is null, so we can't specify: bool newLineOnAttributes, bool omitXmlDeclaration
            writer.Formatting = indent ? Formatting.Indented : Formatting.None;

            if (indentChar != null)
                writer.IndentChar = (char)indentChar;

            // -- write --
            xml.WriteTo(writer); 
        }

        return swriter.ToString();
    }

-- 这里使用以下的 XmlTextWriter --

public class XmlTextWriterIgnoreInvalidChars : XmlTextWriter
{
    public bool DeleteInvalidChars { get; set; }

    public XmlTextWriterIgnoreInvalidChars(TextWriter w, bool deleteInvalidChars = true) : base(w)
    {
        DeleteInvalidChars = deleteInvalidChars;
    }

    public override void WriteString(string text)
    {
        if (text != null && DeleteInvalidChars)
            text = XML.ToValidXmlCharactersString(text);
        base.WriteString(text);
    }
}

4
我在接收 @parapurarajkumar 的解决方案,其中非法字符已被正确加载到 XmlDocument 中,但在尝试保存输出时会破坏 XmlWriter我的背景 我正在使用 Elmah 查看网站的异常/错误日志。Elmah以一个大型XML文档的形式返回异常发生时服务器的状态。为了我们的报告引擎,我使用XmlWriter美化XML。
在一次网站攻击中,我注意到某些xml无法解析并收到此异常:'.',十六进制值0x00,是无效字符。 未解决的问题: 我将文档转换为byte []并对其进行了0x00的处理,但没有发现任何问题。
当我扫描XML文档时,发现以下内容:
...
<form>
...
<item name="SomeField">
   <value
     string="C:\boot.ini&#x0;.htm" />
 </item>
...

这里出现了一个以html实体编码为&#x0;的nul字节!!!

解决方法: 为了修复编码问题,在将其加载到我的XmlDocument中之前,我替换了&#x0;值。因为加载它会创建空字节,并且很难从对象中进行净化处理。以下是我的整个过程:

XmlDocument xml = new XmlDocument();
details.Xml = details.Xml.Replace("&#x0;", "[0x00]");  // in my case I wanted to see it, otherwise just replace with ""
xml.LoadXml(details.Xml);

string formattedXml = null;

// I stuff this all in a helper function, but put it in-line for this example
StringBuilder sb = new StringBuilder();
XmlWriterSettings settings = new XmlWriterSettings {
    OmitXmlDeclaration = true,
    Indent = true,
    IndentChars = "\t",
    NewLineHandling = NewLineHandling.None,
};
using (XmlWriter writer = XmlWriter.Create(sb, settings)) {
    xml.Save(writer);
    formattedXml = sb.ToString();
}

教训:如果你的输入数据已经进行了HTML编码,请使用相关的HTML实体来消除非法字节。


1
有一个通用的解决方案,可以很好地工作:

public class XmlTextTransformWriter : System.Xml.XmlTextWriter
{
    public XmlTextTransformWriter(System.IO.TextWriter w) : base(w) { }
    public XmlTextTransformWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { }
    public XmlTextTransformWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { }

    public Func<string, string> TextTransform = s => s;

    public override void WriteString(string text)
    {
        base.WriteString(TextTransform(text));
    }

    public override void WriteCData(string text)
    {
        base.WriteCData(TextTransform(text));
    }

    public override void WriteComment(string text)
    {
        base.WriteComment(TextTransform(text));
    }

    public override void WriteRaw(string data)
    {
        base.WriteRaw(TextTransform(data));
    }

    public override void WriteValue(string value)
    {
        base.WriteValue(TextTransform(value));
    }
}

一旦这个设置就位,您可以按照以下方式创建覆盖 THIS:
public class XmlRemoveInvalidCharacterWriter : XmlTextTransformWriter
{
    public XmlRemoveInvalidCharacterWriter(System.IO.TextWriter w) : base(w) { SetTransform(); }
    public XmlRemoveInvalidCharacterWriter(string filename, System.Text.Encoding encoding) : base(filename, encoding) { SetTransform(); }
    public XmlRemoveInvalidCharacterWriter(System.IO.Stream w, System.Text.Encoding encoding) : base(w, encoding) { SetTransform(); }

    void SetTransform()
    {
        TextTransform = XmlUtil.RemoveInvalidXmlChars;
    }
}

其中 XmlUtil.RemoveInvalidXmlChars 定义如下:

    public static string RemoveInvalidXmlChars(string content)
    {
        if (content.Any(ch => !System.Xml.XmlConvert.IsXmlChar(ch)))
            return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
        else
            return content;
    }

0

这个字符串不能用以下方式清理吗:

System.Net.WebUtility.HtmlDecode()

?


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接