首先是一些例子:
String cs;
String name = "façade";
byte[] nameBytes;
System.out.println(String.format("String '%s': %d", name, name.length()));
cs = "UTF-8";
nameBytes = name.getBytes(Charset.forName(cs));
System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));
cs = "UTF-16";
nameBytes = name.getBytes(Charset.forName(cs));
System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));
cs = "UTF-16BE";
nameBytes = name.getBytes(Charset.forName(cs));
System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));
带有输出:
String 'façade': 6 ---> 6 characters with one outside ASCII range
UTF-8: 7 / 6 ---> 'ç' requires 2 bytes, the others only one
UTF-16: 14 / 6 ---> 2 x 6 bytes for code points + 2 bytes for BOM
UTF-16BE: 12 / 6 ---> no need to embedded the BOM here => 2 x 6 bytes are enough
评论:
- 始终指定字符集,即双向指定。
- 关于BOM,请参见字节顺序标记。
- 引自Unicode字符表示:char数据类型(因此Character对象封装的值)基于最初的Unicode规范,该规范将字符定义为固定宽度的16位实体。
这里的问题是关于数据库中使用的字符集。如果是UTF-8,则在达到200个字节限制时需要逐个字符检查。使用UTF-8时,无法在任意字节数上截断字符串:它可能在任何两个字节字符的中间。结果是不可预测的。