Java中如何检查一个字符串是否是有效的UTF-8编码

Question

Java中如何检查一个字符串是否是有效的UTF-8编码

javaencodingutf-8

48

我怎样才能检查一个字符串是否符合有效的UTF-8格式？

- Michael Bavin

6

你的意思是 byte[] 是有效编码的吗？ - bestsss

最简单的方法可能是将其解码并重新编码。确保您得到相同的结果。这在几乎所有情况下都是正确的。 - Peter Lawrey

@Peter，那并不总是有效的，因为有些字符可以用不同的字节序列进行编码。这两个字节序列都是正确的，并且编码相同的字符，但字节是不同的。 - Jesper

@Jesper，如果数据已经使用Java编码，那么它将是相同的。这取决于OP真正想要测试什么。顺便说一句，在Java中，\0字符被编码为两个字节。 ;) - Peter Lawrey

2个回答

4

以下文章摘自官方Java教程，可在以下链接中查看： https://docs.oracle.com/javase/tutorial/i18n/text/string.html。

The StringConverter program starts by creating a String containing Unicode characters:
String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
When printed, the String named original appears as:
AêñüC
To convert the String object to UTF-8, invoke the getBytes method and specify the appropriate encoding identifier as a parameter. The getBytes method returns an array of bytes in UTF-8 format. To create a String object from an array of non-Unicode bytes, invoke the String constructor with the encoding parameter. The code that makes these calls is enclosed in a try block, in case the specified encoding is unsupported:
try {
    byte[] utf8Bytes = original.getBytes("UTF8");
    byte[] defaultBytes = original.getBytes();

    String roundTrip = new String(utf8Bytes, "UTF8");
    System.out.println("roundTrip = " + roundTrip);
    System.out.println();
    printBytes(utf8Bytes, "utf8Bytes");
    System.out.println();
    printBytes(defaultBytes, "defaultBytes");
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}
The StringConverter program prints out the values in the utf8Bytes and defaultBytes arrays to demonstrate an important point: The length of the converted text might not be the same as the length of the source text. Some Unicode characters translate into single bytes, others into pairs or triplets of bytes. The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file, UnicodeFormatter.java. Here is the printBytes method:
public static void printBytes(byte[] array, String name) {
    for (int k = 0; k < array.length; k++) {
        System.out.println(name + "[" + k + "] = " + "0x" +
            UnicodeFormatter.byteToHex(array[k]));
    }
}
The output of the printBytes method follows. Note that only the first and last bytes, the A and C characters, are the same in both arrays:
utf8Bytes[0] = 0x41
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xaa
utf8Bytes[3] = 0xc3
utf8Bytes[4] = 0xb1
utf8Bytes[5] = 0xc3
utf8Bytes[6] = 0xbc
utf8Bytes[7] = 0x43
defaultBytes[0] = 0x41
defaultBytes[1] = 0xea
defaultBytes[2] = 0xf1
defaultBytes[3] = 0xfc
defaultBytes[4] = 0x43

- Bhanu PS Kushwah

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- DArkO · Accepted Answer

只有字节数据可以进行检查。如果您构造了一个字符串，那么它已经在内部以UTF-16的形式存在。

此外，只有字节数组可以进行UTF-8编码。

这是一个常见的UTF-8转换示例。

String myString = "\u0048\u0065\u006C\u006C\u006F World";
System.out.println(myString);
byte[] myBytes = null;

try 
{
    myBytes = myString.getBytes("UTF-8");
} 
catch (UnsupportedEncodingException e)
{
    e.printStackTrace();
    System.exit(-1);
}

for (int i=0; i < myBytes.length; i++) {
    System.out.println(myBytes[i]);
}

如果你不知道字节数组的编码，juniversalchardet 是一个可以帮助你检测它的库。