在mysql中解决“Illegal mix of collations”错误的故障排除

285
当尝试通过MySQL存储过程执行Select语句时,出现以下错误:

Illegal mix of collations (latin1_general_cs,IMPLICIT) and (latin1_general_ci,IMPLICIT) for operation '='

请问这可能是什么问题?
表的排序规则是latin1_general_ci,而where子句中列的排序规则是latin1_general_cs

你可以尝试使用这个脚本,将你的所有数据库和表转换为utf8编码。 - Mirat Can Bayrak
3
我使用各种数据库已经有很长时间了(自1990年以来),使用MySQL的排序规则和强制性转换方式显得“疯狂”,因为数据库解决问题是通过将“一个”字符集强制应用于整个数据库,然后由导入/导出过程来转换成/从数据库所使用的唯一字符集。 MySQL选择的解决方案是一种破坏性的方案,因为它将“应用程序问题”(字符集转换)与数据库问题(排序规则的使用)混合在一起。 为什么不从数据库中“移除”这些繁琐和愚蠢的功能,使其更加易于使用和可控。 - Maurizio Pievaioli
18个回答

269

这通常是由于比较两个不兼容的排序规则字符串或尝试将不同排序规则的数据选择到组合列中造成的。

COLLATE 子句允许您指定查询中使用的排序规则。

例如,以下 WHERE 子句将始终导致您发布的错误:

WHERE 'A' COLLATE latin1_general_ci = 'A' COLLATE latin1_general_cs

你的解决方案是在查询中为这两个列指定一个共享排序规则。以下是使用COLLATE子句的示例:

SELECT * FROM table ORDER BY key COLLATE latin1_general_ci;

另一个选择是使用BINARY运算符:

BINARY strCAST(str AS BINARY)的简写。

您的解决方案可能如下所示:

SELECT * FROM table WHERE BINARY a = BINARY b;

或者,
SELECT * FROM table ORDER BY BINARY a;

请记住,正如Jacob Stamm在评论中指出的那样,“将列转换为比较它们将导致忽略该列上的任何索引”。
关于这个排序业务的更多细节,我强烈推荐阅读eggyal对同一问题的精彩回答

4
谢谢。实际上,在我的情况下,它的表现非常奇怪。当我直接在查询浏览器中运行该查询时,可以获取结果。但是如果使用存储过程,则会出现错误。 - user355562
8
如果你没有使用任何复杂的过滤器,二进制对我来说似乎是最好的解决方案,如果你也是这种情况,它可能也是最好的解决方案。 - Adam Fowler
1
我有同样的问题,解决方法是从头开始重新创建。我尝试更改排序规则,但当我进行连接时仍然出现错误,所以我尝试了这种方法。如有不对请指正。 - Bobby Z
1
请注意,MariaDB 中使用 COLLATE latin1_general_ci 存在一个错误,会导致另一个错误:COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1' - 即使您没有一个 CHARACTER SET 'latin1' 的列!解决方法是使用 BINARY 转换。也可以参考这个问题 - Mel_T
2
注意,将列强制转换以进行比较将导致忽略该列上的任何索引。 - Jacob Stamm
非常好的观点,Jacob! - defines

216

TL;DR

要么改变一个(或两个)字符串的排序规则,使它们匹配,要么在表达式中添加一个COLLATE子句。


  1. What is this "collation" stuff anyway?

    As documented under Character Sets and Collations in General:

    A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.

    Suppose that we have an alphabet with four letters: “A”, “B”, “a”, “b”. We give each letter a number: “A” = 0, “B” = 1, “a” = 2, “b” = 3. The letter “A” is a symbol, the number 0 is the encoding for “A”, and the combination of all four letters and their encodings is a character set.

    Suppose that we want to compare two string values, “A” and “B”. The simplest way to do this is to look at the encodings: 0 for “A” and 1 for “B”. Because 0 is less than 1, we say “A” is less than “B”. What we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): “compare the encodings.” We call this simplest of all possible collations a binary collation.

    But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters “a” and “b” as equivalent to “A” and “B”; (2) then compare the encodings. We call this a case-insensitive collation. It is a little more complex than a binary collation.

    In real life, most character sets have many characters: not just “A” and “B” but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules, not just for whether to distinguish lettercase, but also for whether to distinguish accents (an “accent” is a mark attached to a character as in German “Ö”), and for multiple-character mappings (such as the rule that “Ö” = “OE” in one of the two German collations).

    Further examples are given under Examples of the Effect of Collation.

  2. Okay, but how does MySQL decide which collation to use for a given expression?

    As documented under Collation of Expressions:

    In the great majority of statements, it is obvious what collation MySQL uses to resolve a comparison operation. For example, in the following cases, it should be clear that the collation is the collation of column charset_name:

    SELECT x FROM T ORDER BY x;
    SELECT x FROM T WHERE x = x;
    SELECT DISTINCT x FROM T;
    

    However, with multiple operands, there can be ambiguity. For example:

    SELECT x FROM T WHERE x = 'Y';
    

    Should the comparison use the collation of the column x, or of the string literal 'Y'? Both x and 'Y' have collations, so which collation takes precedence?

    Standard SQL resolves such questions using what used to be called “coercibility” rules.

    [ deletia ]

    MySQL uses coercibility values with the following rules to resolve ambiguities:

    • Use the collation with the lowest coercibility value.

    • If both sides have the same coercibility, then:

      • If both sides are Unicode, or both sides are not Unicode, it is an error.

      • If one of the sides has a Unicode character set, and another side has a non-Unicode character set, the side with Unicode character set wins, and automatic character set conversion is applied to the non-Unicode side. For example, the following statement does not return an error:

        SELECT CONCAT(utf8_column, latin1_column) FROM t1;
        

        It returns a result that has a character set of utf8 and the same collation as utf8_column. Values of latin1_column are automatically converted to utf8 before concatenating.

      • For an operation with operands from the same character set but that mix a _bin collation and a _ci or _cs collation, the _bin collation is used. This is similar to how operations that mix nonbinary and binary strings evaluate the operands as binary strings, except that it is for collations rather than data types.

  3. So what is an "illegal mix of collations"?

    An "illegal mix of collations" occurs when an expression compares two strings of different collations but of equal coercibility and the coercibility rules cannot help to resolve the conflict. It is the situation described under the third bullet-point in the above quotation.

    The particular error given in the question, Illegal mix of collations (latin1_general_cs,IMPLICIT) and (latin1_general_ci,IMPLICIT) for operation '=', tells us that there was an equality comparison between two non-Unicode strings of equal coercibility. It furthermore tells us that the collations were not given explicitly in the statement but rather were implied from the strings' sources (such as column metadata).

  4. That's all very well, but how does one resolve such errors?

    As the manual extracts quoted above suggest, this problem can be resolved in a number of ways, of which two are sensible and to be recommended:

    • Change the collation of one (or both) of the strings so that they match and there is no longer any ambiguity.

      How this can be done depends upon from where the string has come: Literal expressions take the collation specified in the collation_connection system variable; values from tables take the collation specified in their column metadata.

    • Force one string to not be coercible.

      I omitted the following quote from the above:

      MySQL assigns coercibility values as follows:

      • An explicit COLLATE clause has a coercibility of 0. (Not coercible at all.)

      • The concatenation of two strings with different collations has a coercibility of 1.

      • The collation of a column or a stored routine parameter or local variable has a coercibility of 2.

      • A “system constant” (the string returned by functions such as USER() or VERSION()) has a coercibility of 3.

      • The collation of a literal has a coercibility of 4.

      • NULL or an expression that is derived from NULL has a coercibility of 5.

      Thus simply adding a COLLATE clause to one of the strings used in the comparison will force use of that collation.

    Whilst the others would be terribly bad practice if they were deployed merely to resolve this error:

    • Force one (or both) of the strings to have some other coercibility value so that one takes precedence.

      Use of CONCAT() or CONCAT_WS() would result in a string with a coercibility of 1; and (if in a stored routine) use of parameters/local variables would result in strings with a coercibility of 2.

    • Change the encodings of one (or both) of the strings so that one is Unicode and the other is not.

      This could be done via transcoding with CONVERT(expr USING transcoding_name); or via changing the underlying character set of the data (e.g. modifying the column, changing character_set_connection for literal values, or sending them from the client in a different encoding and changing character_set_client / adding a character set introducer). Note that changing encoding will lead to other problems if some desired characters cannot be encoded in the new character set.

    • Change the encodings of one (or both) of the strings so that they are both the same and change one string to use the relevant _bin collation.

      Methods for changing encodings and collations have been detailed above. This approach would be of little use if one actually needs to apply more advanced collation rules than are offered by the _bin collation.


6
请注意,“collations的非法混合”也可能在没有关于应使用哪个排序规则的歧义时出现,但是必须将要强制转换的字符串转码为某些字符无法表示的编码。我已经在先前的答案中讨论了这种情况。 - eggyal
5
好的回答。这个应该排在前面,因为它深入了解了开发者真正需要知道的东西;不仅仅是如何修复,而是真正理解为什么事情会以它们发生的方式发生。 - mark
1
谢谢,老兄。你今天教会了我一些东西。 - briankip
非常好的答案,真的很有信息量。有时候SO(Stack Overflow)关注的太少是“为什么它不起作用”,而更多地是“复制并粘贴这个以使其工作”。 - Florian Loch
注意到很久以前点赞了但从未评论,这是一个很好的回答! - defines
那才是一个答案!为什么,如何,案例… - Arno Teigseth

92

为了帮助未来的搜索者,我想加入我的一些认识。

我遇到了类似的问题,当使用接收varchar参数的自定义函数时,出现了以下错误:

Illegal mix of collations (utf8_unicode_ci,IMPLICIT) and 
(utf8_general_ci,IMPLICIT) for operation '='

使用以下查询:

mysql> show variables like "collation_database";
    +--------------------+-----------------+
    | Variable_name      | Value           |
    +--------------------+-----------------+
    | collation_database | utf8_general_ci |
    +--------------------+-----------------+

我能够确定数据库使用的是utf8_general_ci,而表使用的是utf8_unicode_ci

mysql> show table status;
    +--------------+-----------------+
    | Name         | Collation       |
    +--------------+-----------------+
    | my_view      | NULL            |
    | my_table     | utf8_unicode_ci |
    ...

请注意,这些视图具有NULL排序规则。尽管此查询显示一个视图为空,但似乎视图和函数都有排序规则定义。使用的排序规则是在创建视图/函数时定义的数据库排序规则。

令人沮丧的解决方法是更改数据库排序规则并重新创建视图/函数,以强制它们使用当前的排序规则。

  • 更改数据库的排序规则:

    ALTER DATABASE mydb DEFAULT COLLATE utf8_unicode_ci;
    
  • 更改表的排序规则:

  • ALTER TABLE mydb CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
    
    我希望这能帮助到某些人。

18
排列顺序也可以在列级别上进行设置。您可以使用以下命令查看:show full columns from my_table; - Jonathan Tran
1
谢谢。我刚刚删除了模式,并使用正确的默认排序规则重新创建了它,然后重新导入了所有内容。 - JRun
2
@JonathanTran 谢谢!我已经在所有表格、数据库和连接上设置了字符集和排序规则,但仍然出现错误!原来是某个列没有设置排序规则!我用 alter table <TABLE> modify column <COL> varchar(255) collate utf8_general_ci; 命令解决了问题。 - Chloe
5
未来的谷歌搜索者请注意:即使您的数据库、表和字段都具有相同的排序规则,您还必须确保您的连接使用相同的排序规则。虽然一切都是“utf8mb4_unicode_ci”,但SHOW session variables like '%collation%';告诉您“collation_connection”是“utf8mb4_general_ci”?那么,请先运行SET collation_connection = utf8mb4_unicode_ci - pixelbrackets
1
@pixelbrackets 谢谢。我花了一整天的时间在这个问题上,因为我正在创建带有派生列的视图,而这是我解决混合排序规则问题的唯一方法。今天真的很长。 - Matt D.
显示剩余2条评论

20

有时将字符集转换可能会很危险,尤其是对于具有大量数据的数据库。我认为最好的选择是使用 "binary" 运算符:

e.g : WHERE binary table1.column1 = binary table2.column1

这样安全吗?我不知道字符串和二进制之间的数据是如何转换的,但是不同编码的两个不同字符串可能具有相同的二进制表示。 - MakotoE
适用于我的目的,因为字符集分别为utf8mb4_unicode_520_ci和utf8mb4_unicode_ci,而且只有几千行。 - Bret Weinraub

19

我遇到了类似的问题,试图使用 FIND_IN_SET 过程与一个字符串变量一起使用。

SET @my_var = 'string1,string2';
SELECT * from my_table WHERE FIND_IN_SET(column_name,@my_var);

并且收到错误

错误代码: 1267。非法混合字符集 (utf8_unicode_ci,IMPLICIT) 和 (utf8_general_ci,IMPLICIT),用于操作 'find_in_set'

简短回答:

无需更改任何collation_YYYY变量,只需在变量声明旁边添加正确的字符集即可,例如:

SET @my_var = 'string1,string2' COLLATE utf8_unicode_ci;
SELECT * from my_table WHERE FIND_IN_SET(column_name,@my_var);

长答案:

我首先检查了排序变量:

mysql> SHOW VARIABLES LIKE 'collation%';
    +----------------------+-----------------+
    | Variable_name        | Value           |
    +----------------------+-----------------+
    | collation_connection | utf8_general_ci |
    +----------------------+-----------------+
    | collation_database   | utf8_general_ci |
    +----------------------+-----------------+
    | collation_server     | utf8_general_ci |
    +----------------------+-----------------+

然后我检查了表的字符集排序规则:

mysql> SHOW CREATE TABLE my_table;

CREATE TABLE `my_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `column_name` varchar(40) COLLATE utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=125 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

这意味着我的变量被配置为默认的utf8_general_ci校对规则,而我的表被配置为utf8_unicode_ci校对规则。

通过在变量声明旁边添加COLLATE命令,变量的校对规则与表格配置的校对规则相匹配。


9

以下解决方案对我有用。

CONVERT( Table1.FromColumn USING utf8)    =  CONVERT(Table2.ToColumn USING utf8) 

不确定这个的性能如何,但肯定是有效的! - Tom Nijs

4

如果涉及文字,则解决方案。

我正在使用Pentaho Data Integration,无法指定SQL语法。使用非常简单的DB查找出现了错误“Illegal mix of collations (cp850_general_ci,COERCIBLE) and (latin1_swedish_ci,COERCIBLE) for operation '='"

生成的代码为 "SELECT DATA_DATE AS latest_DATA_DATE FROM hr_cc_normalised_data_date_v WHERE PSEUDO_KEY = ?"

简而言之,查找是针对一个视图进行的,当我发出

mysql> show full columns from hr_cc_normalised_data_date_v;
+------------+------------+-------------------+------+-----+
| Field      | Type       | Collation         | Null | Key |
+------------+------------+-------------------+------+-----+
| PSEUDO_KEY | varchar(1) | cp850_general_ci  | NO   |     |
| DATA_DATE  | varchar(8) | latin1_general_cs | YES  |     |
+------------+------------+-------------------+------+-----+

这解释了'cp850_general_ci'的来源。

该视图是通过'SELECT 'X',......'简单创建的。 根据手册,像这样的文字应该继承服务器设置的字符集和排序规则,而服务器设置正确地定义为'latin1'和'latin1_general_cs',但很明显这并没有发生,所以我在视图创建时强制设置了它。

CREATE OR REPLACE VIEW hr_cc_normalised_data_date_v AS
SELECT convert('X' using latin1) COLLATE latin1_general_cs        AS PSEUDO_KEY
    ,  DATA_DATE
FROM HR_COSTCENTRE_NORMALISED_mV
LIMIT 1;

现在两列都显示为latin1_general_cs,错误已经消失。:)

3
如果你遇到的问题是“哈希值”,请考虑以下几点...
如果“哈希”是二进制字符串,你应该使用BINARY(...)数据类型。
如果“哈希”是十六进制字符串,则不需要utf8,并且应该避免使用utf8,因为涉及字符检查等问题。例如,MySQL的MD5(...)生成固定长度为32个字节的十六进制字符串。SHA1(...)生成40个字节的十六进制字符串。这可以存储到CHAR(32) CHARACTER SET ascii中(SHA1为40)。
或者,更好的方法是将UNHEX(MD5(...))存储到BINARY(16)中。这样可以将列的大小减小一半。(但是,它会变得不太可打印。)如果想要可读的哈希值,请使用SELECT HEX(hash) ...
比较两个BINARY列没有排序问题。

3
非常有趣……现在,准备好了。 我查看了所有的“添加排序规则”解决方案,对我来说,这些都是临时性的修复措施。实际情况是数据库设计“不好”。 是的,标准变化和新事物的添加等等,但这并不改变糟糕的数据库设计事实。 我拒绝采用在SQL语句中添加“排序规则”的方法来使我的查询工作。 对我而言唯一有效的解决方案,将几乎消除未来调整代码的需要,是重新设计数据库/表以匹配我将长期使用和接受的字符集。 在这种情况下,我选择使用字符集“utf8mb4”。
因此,在遇到“非法”错误消息时的解决方案是重新设计您的数据库和表。 这比听起来要容易得多,也更快。 可能甚至不需要从CSV文件中导出和重新导入您的数据。 更改数据库的字符集,并确保所有表的字符集匹配。
请使用以下命令进行指导:
SHOW VARIABLES LIKE "collation_database";
SHOW TABLE STATUS;

如果你喜欢随处添加“collate”,并使用强制性的“overrides”来增强你的代码,那就随便吧。


2

MySQL非常不喜欢混合使用字符集,除非它可以将它们强制转换为相同的字符集(显然在您的情况下不可行)。您不能通过COLLATE子句强制使用相同的字符集吗?(如果适用,还可以使用更简单的BINARY快捷方式...)。


这只适用于MySQL吗?其他系统如何处理表面上优先级相等的不兼容排序混合的情况? - eggyal
您的链接无效。 - Benubird

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接