在Latin1数据库中保存UTF-8数据:可行吗?

7
我有一个Rails应用程序,从Android设备接收数据。我注意到一些数据在日语情况下保存不正确。在MySQL客户端和Rails网站中,它显示为文字问号(而不是菱形问号)。
原来,我连接到Rails应用程序的数据库设置为Latin1。Rails设置为UTF-8。
我阅读了很多关于字符编码的文章,但它们都提到数据在某种程度上还是可以读取的。然而,我的数据只有文字问号。尝试使用网络上的几种方法将数据转换为UTF-8也没有改变什么。我怀疑数据在写入数据库时被转换为问号。
MySQL控制台的样本输出:
select * from foo where bar = "foobar";
+-------+------+------------------------+---------------------+---------------------+
| id    | name | bar                    | created_at          | updated_at          |
+-------+------+------------------------+---------------------+---------------------+
| 24300 | ???? | foobar                 | 2012-01-23 05:04:22 | 2012-01-23 05:04:22 |
+-------+------+------------------------+---------------------+---------------------+
1 row in set (0.00 sec)

我的Rails应用从Android客户端获得的输入数据是:
name = 爆笑笑話

在保存到数据库之前,这些输入数据已经被验证存在于rails应用程序中。因此,它在Android客户端或在传输到服务器过程中不会被破坏。是否有可能获取这些数据?还是说它已经完全丢失了?


1
如果你执行的是 SELECT HEX(name) FROM foo,那么返回的结果若是 "3F3F3F...",则你的数据已经损坏无法恢复;但如果输出的结果是其他值,可能还有一线生机去挽救它。 - user149341
确实是3f3f3f,我会检查一下最近插入的一些其他数据,并且不可能是由我恢复的某个旧备份引起的。 - Peterdk
如果是这种情况,数据库实际上只包含问号 -- 原始数据已经无法恢复。eggyal提供的下一步操作建议很好。 - user149341
是的,今天写入的记录中也有这个问题。非常奇怪,你本应该期望Rails 3能以某种正确的方式处理它,而不是以这种方式丢失数据。 - Peterdk
我曾经遇到过同样的问题,并在此处详细记录了我的解决方案:[http://dba.stackexchange.com/a/96322/14447] - Fabio
1个回答

10

实际上,很容易会 认为 数据是以某种方式进行编码的,但事实上它却是以其他方式进行编码的:这是因为任何直接检索数据的尝试都将首先转换为您的数据库连接的字符集,然后再转换为输出介质的字符集——因此,您应该通过以下任一方式先验证存储数据的实际编码:SELECT BINARY name FROM foo WHERE bar = 'foobar'SELECT HEX(name) FROM foo WHERE bar = 'foobar'

当期望出现字符 时,您很可能会发现以下任一个字节序列:

  • 0xe78886, indicating that your column actually contains UTF-8 encoded data: this usually happens when the character set of the database connection over which the text was originally inserted was set to latin1 but actually UTF-8 encoded data was sent.

    You must be seeing ? characters when fetching the data because something between the data storage and the display has been unable to transcode those bytes (however, given that MySQL thinks they represent 爆 and those characters are likely available in most character sets, it's unlikely that it's occurring within MySQL itself—unless you're explicitly adjusting the encoding information during retrieval).

    Anyway, if this is the case, you need to drop the encoding information from the column and then tell MySQL that the data is actually encoded as UTF-8. As documented under ALTER TABLE Syntax:

    Warning 

    The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:

    ALTER TABLE t1 CHANGE c1 c1 BLOB;
    ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
    

    The reason this works is that there is no conversion when you convert to or from BLOB columns.

  • 0x3f, indicating that the database does actually contain the literal character ? and your original data has been lost: this doesn't happen easily, since MySQL usually throws error 1366 if implicit transcoding results in loss of data. Perhaps there was some explicit transcoding in your insert statement?

    In this case, you need to convert the storage encoding to a suitable format, then update or re-insert the data:

    ALTER TABLE foo CONVERT TO utf8;
    UPDATE foo SET name = _utf8 '爆笑笑話' WHERE bar = 'foobar';
    

不幸的是,该表包含3f3f3f。所以某种程度上Rails与我的数据库不太兼容,数据已经丢失了。虽然不是非常关键,但仍然很遗憾。 - Peterdk
当我使用_utf8命令更新字段时,数据立即丢失。这正常吗?(尚未转换任何内容) - Peterdk
1
character_set_connection = utf8,character_set_database = latin1,字符集列为latin1。 - Peterdk
我认为,这是同一个软件包。(在服务器上运行命令行,所以是的) - Peterdk
很有意思。明天会进一步调查,但很快就要睡觉了! - eggyal
显示剩余5条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接