SQL Server SQL_Latin1_General_CP1_CI_AS能否安全地转换为Latin1_General_CI_AS?

34
我们有一个遗留的数据库,其中一些(较旧)列使用"SQL_Latin1_General_CP1_CI_AS",而最近的更改使用了"Latin1_General_CI_AS"。这很麻烦,因为连接需要额外的COLLATE语句才能工作。
我想把所有东西都升级到"Latin1_General_CI_AS"。据我所知,它们几乎是相同的排序规则,在这个过程中不会丢失数据...
有人知道这是否正确吗?
5个回答

47

以下是更完整的答案:

https://www.olcot.co.uk/revised-difference-between-collation-sql_latin1_general_cp1_ci_as-and-latin1_general_ci_as/

这些排序规则之间的主要区别在于它们如何应用字符扩展规则。某些拉丁字符可能会扩展为多个字符。当使用非unicode文本时,SQL_xxxx排序规则可能忽略这些字符扩展,但对于unicode文本则应用它们。因此:在使用一种排序规则与另一种排序规则时,连接、排序和比较可能返回不同的结果。

例如:

Latin1_General_CI_AS下,这两个语句将返回相同的记录集,因为 ß 被扩展为 ss

SELECT * FROM MyTable3 WHERE Comments = 'strasse'
SELECT * FROM MyTable3 WHERE Comments = 'straße'

使用SQL_Latin1_General_CP1_CI_AS时,上述语句返回不同的记录,因为ß被视为不同于ss的字符。


2
在我的答案中,我包括了字符扩展的信息(以及Unicode文档的链接),并添加了几个需要注意的项目 :-). - Solomon Rutzky
1
如果“Comments”是“NVARCHAR”,那该怎么办? “SQL_Latin1_General_CP1_CI_AS”不支持Unicode,因此会回退到“Latin1_General_CI_AS”,它处理“ass”=“aß”。如何强制“NVARCHAR”比较“ss”<>“ß”? - djk
@Hannobo 在定义表、临时表或表变量中的任何文本字段时,您可以指定排序规则。对于许多基于文本的操作,您还可以通过添加 COLLATE {desired_collation} 来指定排序规则。我不确定,但您可能可以在定义变量时指定排序规则。由于我是凭记忆回答这个问题的,建议您查阅 SQL Server T-SQL 文档以获取语法和更多详细信息。 - Zarepheth
@Zarepheth 我试过了,但是文档和测试都表明任何排序规则(无论是在字段上还是在查询中)都会回退到 NVARCHAR 的标准 Unicode。我找到的唯一方法是动态地转换为 VARCHAR。 - djk
3
@Hannobo(和Zarepheth):不要转换为VARCHAR,因为那会导致数据丢失。你需要使用匈牙利语或二进制排序规则,因为它们是唯一不将这些特定字符等同的排序规则。所以尝试使用COLLATE Hungarian_Technical_100_CI_AS_SCCOLLATE Latin1_General_100_BIN2。有关更多详细信息,请查看我的答案在这里在这里。此外,变量(和字符串字面值)始终具有当前数据库默认排序规则的排序规则,并且无法赋予不同的排序规则。 - Solomon Rutzky

34
如果您要更改数据库的排序规则,则一定有些事情您需要了解,以便您可以做好相应的计划:
  • Regarding data-loss potential:

    • NVARCHAR fields are all Unicode, which is a single character set, so there can't be any data loss for these fields (this also covers XML fields which are also stored as UTF-16 Little Endian). Meta-data fields that store the object / column / index / etc names are all NVARCHAR so no need to worry about those.
    • VARCHAR fields having different Collations but the same Code Page between the differing Collations will not be a problem since the Code Page is the character set.
    • VARCHAR fields having different Collations and moving to a different Code Page (when changing Collations) can have data loss if any of the characters being used are not represented in the new Code Page. HOWEVER, this is only an issue when physically changing the Collation of a particular field (described below) and would not happen upon changing the default Collation of a database.
  • Local variables and string literals get their Collation from the Database default. Changing the database default will change the Collation used for both local variables and string literals. But changing the Database's default Collation does not change the Collation used for existing string columns in the tables in that Database. This generally should not cause any problems when comparing or concatenating a column with a literal and/or variable since the literals and variables will take on the Collation of the column due to Collation Precedence. The only potential problem would be Code Page conversions that might occur for characters of values between 128 - 255 that are not available in the Code Page used by the Collation of the column.

  • If you are expecting a predicate / comparison / sort / concatenation / etc for a column to behave differently upon changing the Database's default Collation, then you will need to explicitly change that column's Collation using the following command:

    ALTER TABLE [{table_name}]
       ALTER COLUMN [{column_name}]
       {same_datatype}
       {same_NULL_or_NOT NULL_setting}
       COLLATE {name_of_Database_default_Collation};
    

    Be sure to specify the exact same datatype and NULL / NOT NULL setting that are currently being used, else they can revert to the default if not already being the default value. After that, if there are any indexes on any of the string columns that just had their Collation changed, then you need to rebuild those indexes.

  • Changing the Database's default Collation will change the Collation of certain database-specific meta-data, such as the name field in both sys.objects, sys.columns, sys.indexes, etc. Filtering these system Views against local variables or string literals won't be a problem since the Collation will be changing on both sides. But, if you JOIN any of the local system Views to temporary tables on string fields, and the Database-level Collation between the local database and tempdb doesn't match, then you will get the "Collation mismatch" error. This is discussed below along with the remedy.

  • One difference between these two Collations is in how they sort certain characters for VARCHAR data (this does not affect NVARCHAR data). The non-EBCDIC SQL_ Collations use what is called "String Sort" for VARCHAR data, while all other Collations, and even NVARCHAR data for the non-EBCDIC SQL_ Collations, use what is called "Word Sort". The difference is that in "Word Sort", the dash - and apostrophe ' (and maybe a few other characters?) are given a very low weight and are essentially ignored unless there are no other differences in the strings. To see this behavior in action, run the following:

    DECLARE @Test TABLE (Col1 VARCHAR(10) NOT NULL);
    INSERT INTO @Test VALUES ('aa');
    INSERT INTO @Test VALUES ('ac');
    INSERT INTO @Test VALUES ('ah');
    INSERT INTO @Test VALUES ('am');
    INSERT INTO @Test VALUES ('aka');
    INSERT INTO @Test VALUES ('akc');
    INSERT INTO @Test VALUES ('ar');
    INSERT INTO @Test VALUES ('a-f');
    INSERT INTO @Test VALUES ('a_e');
    INSERT INTO @Test VALUES ('a''kb');
    
    SELECT * FROM @Test ORDER BY [Col1] COLLATE SQL_Latin1_General_CP1_CI_AS;
    -- "String Sort" puts all punctuation ahead of letters
    
    SELECT * FROM @Test ORDER BY [Col1] COLLATE Latin1_General_100_CI_AS;
    -- "Word Sort" mostly ignores dash and apostrophe
    

    Returns:

    String Sort
    -----------
    a'kb
    a-f
    a_e
    aa
    ac
    ah
    aka
    akc
    am
    ar
    

    and:

    Word Sort
    ---------
    a_e
    aa
    ac
    a-f
    ah
    aka
    a'kb
    akc
    am
    ar
    

    While you will "lose" the "String Sort" behavior, I'm not sure that I would call that a "feature". It is a behavior that has been deemed undesirable (as evidenced by the fact that it wasn't brought forward into any of the Windows collations). However, it is a definite difference of behavior between the two collations (again, just for non-EBCDIC VARCHAR data), and you might have code and/or customer expectations based upon the "String Sort" behavior. This requires testing your code and possibly researching to see if this change in behavior might have any negative impact on users.

  • Another difference between SQL_Latin1_General_CP1_CI_AS and Latin1_General_100_CI_AS is the ability to do Expansions on VARCHAR data (NVARCHAR data can already do these for most SQL_ Collations), such as handling æ as if it were ae:

    IF ('æ' COLLATE SQL_Latin1_General_CP1_CI_AS =
        'ae' COLLATE SQL_Latin1_General_CP1_CI_AS)
    BEGIN
      PRINT 'SQL_Latin1_General_CP1_CI_AS';
    END;
    
    IF ('æ' COLLATE Latin1_General_100_CI_AS =
        'ae' COLLATE Latin1_General_100_CI_AS)
    BEGIN
      PRINT 'Latin1_General_100_CI_AS';
    END;
    

    Returns:

    Latin1_General_100_CI_AS
    

    The only thing you are "losing" here is not being able to do these expansions. Generally speaking, this is another benefit of moving to a Windows Collation. However, just like with the "String Sort" to "Word Sort" move, the same caution applies: it is a definite difference of behavior between the two collations (again, just for VARCHAR data), and you might have code and/or customer expectations based upon not having these mappings. This requires testing your code and possibly researching to see if this change in behavior might have any negative impact on users.

    (first noted in @Zarepheth's answer and expanded on here)

  • Another difference (that is also a benefit of moving to a Windows Collation) is that filtering a VARCHAR column that is indexed on NVARCHAR literal / variable / column you will no longer invalidate the index on the VARCHAR column. This is due to the Windows Collations using the same Unicode sorting and comparison rules for both VARCHAR and NVARCHAR data. Because the sort order is the same between the two types, when the VARCHAR data gets converted into NVARCHAR (explicitly or implicitly due to datatype precedence), the order of items in the index is still valid. For more details on this behavior, please see my post: Impact on Indexes When Mixing VARCHAR and NVARCHAR Types.

  • The server-level Collation is used to set the Collation of the system databases, which includes [model]. The [model] database is used as a template to create new databases, which includes [tempdb] upon each server startup. So, if the Database's default collation does not match the instance's default Collation and you join local tables to temporary tables on string fields, then you will get the Collation-mismatch error. Fortunately there is a somewhat easy way to correct for collation differences between the database that is "current" when CREATE #TempTable is executed and [tempdb]. When creating temporary tables, declare a collation (on string columns) using the COLLATE clause and use either a specific collation (if you know that the DB will always be using that collation), or DATABASE_DEFAULT (if you don't always know the collation of the DB where this code will execute):

    CREATE TABLE #Temp (Col1 NVARCHAR(40) COLLATE DATABASE_DEFAULT);
    

    This is not necessary for table variables since they get their default Collation from the "current" database. However, if you have both table variables and temporary tables and join them on string fields, then you will need to use COLLATE {specific_collation} or COLLATE DATABASE_DEFAULT as shown directly above.

  • The server-level collation also controls local variable names, CURSOR variable names, and GOTO labels. While none of these would be impacted by the specific change being dealt with in this Question, it is at least something to be aware of.

  • It is best to use the most recent version of the desired collation, if multiple versions are available. Starting in SQL Server 2005, a "90" series of collations was introduced, and SQL Server 2008 introduced a "100" series of collations. You can find these collations by using the following queries:

    SELECT * FROM sys.fn_helpcollations() WHERE [name] LIKE N'%[_]90[_]%'; -- 476
    
    SELECT * FROM sys.fn_helpcollations() WHERE [name] LIKE N'%[_]100[_]%'; -- 2686
    
  • ALSO, while the question asks about case-insensitive Collations, it should be noted that if someone else is looking to make a similar change but is using case-sensitive Collations, then another difference between SQL Server Collations and Windows Collations, for VARCHAR data only, is which case sorts first. Meaning, if you have both A and a, the SQL_ Collations will sort A before a, while the non-SQL_ Collations (and the SQL_ Collations when dealing with NVARCHAR data) will sort a before A.

如需有关更改数据库或整个实例的排序规则的更多信息和详细信息,请参阅我的文章:
更改实例、数据库和所有用户数据库中所有列的排序规则:可能会发生什么问题?

有关使用字符串和排序规则的更多信息,请访问:排序规则信息


1
@Jay 谢谢!请重新阅读,因为我做了一些调整。关于文字/变量和索引的一些信息并不完全正确。现在是正确的。我还添加了两个链接到我的帖子,这些帖子在我提交答案时不存在。 - Solomon Rutzky

16

这个MSDN论坛上有更多信息:

http://social.msdn.microsoft.com/Forums/en-US/sqlgetstarted/thread/196b4586-1338-434d-ba8c-49fa3c9bdeeb/

其中提到:

如果排序规则是SQL_Latin1_General_CP1_CI_AS或Latin1_General_CI_AS,你应该不会看到太大的差异,但两者都有快慢之分。

Latin1_General_CI_AS:Latin1-General,不区分大小写,区分重音符号,不区分假名类型,不区分宽度

SQL_Latin1_General_CP1_CI_AS:Latin1-General,不区分大小写,区分重音符号,不区分假名类型,不区分宽度(对于Unicode数据),非Unicode数据在代码页1252上使用SQL Server排序顺序52

因此,在我看来,如果您的数据只包含a-z0-9,则不应该看到任何差异。


2
除了当您想要将 INFORMATION_SCHEMA 的结果与临时表连接或比较,而 AzureSQL 表示它无法执行此操作时,您应该看不出太大差别。 - StingyJack

6
SELECT * FROM ::fn_helpcollations()
WHERE name IN (
'SQL_Latin1_General_CP1_CI_AS',
'Latin1_General_CI_AS'
)

Latin1_General_CI_AS: 这是一种Latin1-General的排序规则,它不区分大小写、重音符号、假名类型和宽度。

SQL_Latin1_General_CP1_CI_AS: 这也是一种Latin1-General的排序规则,它不区分大小写、重音符号、假名类型和宽度,但是对于Unicode数据,使用的是SQL Server Sort Order 52,非Unicode数据使用Code Page 1252进行排序

因此,从中可以推断出所使用的代码页是相同的(Latin1-General => 1252),因此您应该不会遇到任何数据损失 - 如果在转换后发生任何变化,可能是排序顺序 - 这可能是无关紧要的。


-1

要做到这一点,请转到您的数据库属性并选择选项。

然后将集合类型更改为SQL_Latin1_General_CP1_CS_AS。

enter image description here


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接