将varchar()中的重音字符转换为XML时出现“非法的XML字符”错误

4

我有一个由应用程序写入的表格。该字段是varchar(max)类型。数据看起来像是xml。

DECLARE @poit VARCHAR(100)
SET @poit = '<?xml version="1.0" encoding="utf-8"?><test>VÍA</test>'
SELECT CONVERT(XML,@poit)

但是(似乎是因为UTF8的原因;移除它可以解决),我收到了这个错误:
XML parsing: line 1, character 46, illegal xml character

有没有一种干净的方法可以进行转换?

我发现这个线程,它讨论了varchar不支持“非ASCII字符”,尽管显然I是非Unicode的。 是的,我可以这样做:

SELECT CONVERT(XML,REPLACE(@poit,'encoding="utf-8"',''))

但这是最好的方式吗?

为什么将UTF-8 VARCHAR列转换为XML需要转换为NVARCHAR和编码更改?


这篇回答解释了为什么你应该始终使用nvarchar而不是varchar,对于我来说只是几分钱的事情。 - gofr1
是的和不。实际上,对我来说通常不值得花费(存储空间、备份空间和IOPS)。是的,这会使生活更简单,但我已经以每天40GB的速度增长了。 - mbourgon
@gofr1 那个回答过于简单化,基于一个非常错误的神话,即磁盘的唯一成本是金钱。我写了一篇关于这个主题的文章,详细介绍了所有“隐藏”的成本:Disk Is Cheap! ORLY?(该网站需要免费注册才能查看其内容)。此外,我更新了下面的答案,包括有关 XML、类型和在使用它时实际上比使用 VARCHAR(MAX) 占用更少空间的内部优化信息。 - Solomon Rutzky
2个回答

4
<TL;DR> 如�你�想得到答案而�需�完整的解释,请直�滚动到“结论�。但是,你真的应该花时间阅读解释 😸 </TL;DR>

这里有几件事情需�注�:

  1. The encoding= attribute of the <xml> element is used to denote how the underlying bytes of the XML document are to be interpreted. If the document within the string literal is correct, then there is no need to have the encoding attribute. If there are characters that are incorrect, then the encoding attribute can remain as it will inform the XML conversion on what those characters were originally.

  2. UTF-8 is a Unicode encoding, yet you have the variable and literal as VARCHAR data, not NVARCHAR (which also requires prefixing the string literal with a capital-N). By using VARCHAR and no N-prefix, if there were any characters in the XML document that couldn't fit into the Code Page denoted by the default Collation of whatever database you are in when executing this query, you would have already lost those characters (even if you can see them on screen, they wouldn't be correct in the VARCHAR variable, or if you did a simple SELECT of that literal).

  3. Windows (and .NET, SQL Server, etc) use UTF-16 Little Endian. The Í character, Latin Capital Letter I with Acute, exists in both Code Page 1252 and UTF-16LE as value 205 (e.g. SELECT ASCII('Í'), CHAR(205); ), which is why it works when you remove the encoding="utf-8" and why you didn't "lose" that character by placing it in a VARCHAR literal and variable. HOWEVER, as shown on that linked page, the byte sequence in the UTF-8 encoding is 195, 141 (yes, two bytes). Meaning, that character, if it truly was UTF-8 encoded, would not appear to be that character when placed into a UTF-16LE environment.

    The XML conversion looks at that character's byte value of 205 (single byte since it is currently VARCHAR data) and tries to provide the UTF-16LE equivalent of what that sequence is in UTF-8. Except 205 by itself doesn't exist in UTF-8. So you need to add in the next character which is a capital-"A" which has a value of 65. While there are two-byte sequences in UTF-8, none of them are 205, 65. This is why you get the illegal xml character error.

  4. Since the text on screen has to be UTF-16LE, if the source really was UTF-8, then the underlying UTF-8 byte sequence would have to be converted into UTF-16LE. The underlying byte sequence of Í is 195, 141. So we can create that sequence out of regular ASCII characters of Code Page 1252 (since this is, again, currently VARCHAR data) by doing the following:

    DECLARE @poit VARCHAR(100);
    SET @poit = '<?xml version="1.0" encoding="UTF-8"?><test>V'
                  + CHAR(195) + CHAR(141) + 'A</test>';
    SELECT CONVERT(XML, @poit);
    

    Returns:

    <test>VÍA</test>
    

    Data is still VARCHAR and encoding="utf-8" is still in the <xml> element!

  5. If keeping the data as VARCHAR, then the following change of just the encoding= value works:

    DECLARE @poit VARCHAR(100);
    SET @poit = '<?xml version="1.0" encoding="Windows-1252"?><test>VÍA</test>';
    SELECT CONVERT(XML, @poit);
    

    This assumes that the source encoding really was "Windows-1252", which is Microsoft's version of Latin1_General, which is the basis of the Latin1_General collations.

    BUT, there is again no need to even specify the "encoding" if it is the same as the Code Page of the current databases's default collation as that is assumed for any VARCHAR data.

  6. Finally, XML data in SQL Server is UTF-16LE, same as NCHAR and NVARCHAR (and NTEXT, but nobody should be using that anymore).

结论

  1. 在处理XML字符串时,请使用NVARCHAR(MAX)作为数据类型(而不是VARCHAR)。

  2. 对于没有任何更改字符的字符串(即屏幕上看起来完美无缺的字符串),只需删除encoding="utf-8"即可。无需将其替换为UTF-16,因为这已经被假定为一个NVARCHAR变量或文字(即一个以大写字母N为前缀的字符串)的值所具有的特性。


关于使用VARCHAR(MAX)代替XML甚至NVARCHAR(MAX)以节省空间,请记住,XML数据类型在内部进行了优化,使元素和属性名称仅存储一次,存储在字典中,因此与完全书写出的XML字符串版本相比,它们没有那么多的开销。因此,虽然XML类型将字符串存储为UTF-16LE,但如果 XML文档具有大量重复的元素和/或属性名称,则使用XML类型实际上可能会导致比使用VARCHAR(MAX)更小的占用空间:

DECLARE @ElementBased XML;
SET @ElementBased = (
                     SELECT * FROM master.sys.all_columns FOR XML PATH('Row')
                    );

DECLARE @AttributeBased XML;
SET @AttributeBased = (
                       SELECT * FROM master.sys.all_columns FOR XML RAW('Row')
                      );

SELECT @ElementBased AS [ElementBasedXML],
       @AttributeBased AS [AttributeBasedXML],

       DATALENGTH(@ElementBased) AS [ElementBasedXmlBytes],
       DATALENGTH(CONVERT(VARCHAR(MAX), @ElementBased)) AS [ElementBasedVarCharBytes],
       ((DATALENGTH(@ElementBased) * 1.0) / DATALENGTH(CONVERT(VARCHAR(MAX), @ElementBased))
               ) * 100 AS [XmlElementSizeRelativeToVarcharElementSize],

       DATALENGTH(@AttributeBased) AS [AttributeBasedXmlBytes],
       DATALENGTH(CONVERT(VARCHAR(MAX), @AttributeBased)) AS [AttributeBasedVarCharBytes],
       ((DATALENGTH(@AttributeBased) * 1.0) /
         DATALENGTH(CONVERT(VARCHAR(MAX), @AttributeBased))) * 100
               AS [XmlAttributeSizeRelativeToVarCharAttributeSize];

返回值(至少在我的系统上)为:
ElementBasedXmlBytes                              1717896
ElementBasedVarCharBytes                          5889081
XmlElementSizeRelativeToVarcharElementSize        29.170867237180130482100

AttributeBasedXmlBytes                            1544661
AttributeBasedVarCharBytes                        3461864
XmlAttributeSizeRelativeToVarCharAttributeSize    44.619343798600984902900

如您所见,对于基于元素的 XML ,XML 数据类型的大小为 VARCHAR(MAX) 版本的 29%,而对于基于属性的 XML ,XML 数据类型的大小为 VARCHAR(MAX) 版本的 44%。


1
谢谢您加入有关重复字符串的内部优化的最后一点,以及超级完整的答案! - mbourgon
这是一个很好的答案,但是它太长了,我几乎错过了结论并滑到了其他答案。 - nurettin
鉴于Unicode /编码/排序等方面存在大量错误和误解,而且大多数关于此主题的帖子和答案都是不正确的(程度不同),因此我认为这个答案绝对不会太长,因为这些信息非常必要。然而,我也意识到一天中只有那么多时间,所以我在顶部添加了一个注释,告诉人们如果他们不想要解释就可以直接滚动到结论部分。 - Solomon Rutzky
@SolomonRutzky,那很棒。 - nurettin

1
我建议尝试将您的@poit变量的数据类型从VARCHAR(100)更改为NVARCHAR(100)。然后用utf-16替换utf-8编码,这样您的代码会类似于:
    DECLARE @poit NVARCHAR(100)
    SET @poit = '<?xml version="1.0" encoding="utf-8"?><test>VÍA</test>'
    SELECT CONVERT(XML,REPLACE(@poit, 'utf-8', 'utf-16'))

只要您不在返回大量结果的SELECT语句中使用带有替换的转换,性能就应该很好,并且它将完成工作。
参考: http://xml.silmaril.ie/characters.html <- 滚动下去,您会看到一些关于utf-8和utf-16之间差异的信息。希望这可以帮助你!

2
使用NVARCHAR而不是VARCHAR的观点不错。但是,您需要一直保持这种方式,并在文字前加上大写的“N”,否则可能会丢失数据。另外,也没有必要将utf-8替换为utf-16,因为UTF-16是NVARCHAR数据唯一的编码方式 ;-)。如果您有兴趣,我的回答中有完整的解释 :)。 - Solomon Rutzky

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接