Java,使用Scanner输入UTF-8字符,无法打印文本。

3

我可以将字符串转换为UTF-8编码的数组,但我无法像原始字符串那样将其转换回字符串。

public static void main(String[] args) {

    Scanner h = new Scanner(System.in);
    System.out.println("INPUT : ");
    String stringToConvert = h.nextLine();
    byte[] theByteArray = stringToConvert.getBytes();

    System.out.println(theByteArray);
    theByteArray.toString();
    String s = new String(theByteArray);

    System.out.println(""+s);
}

如何将theByteArray以字符串形式打印出来?


看起来对我来说运行良好:http://ideone.com/rcvXl - mellamokb
提供测试输入/预期输出/实际输出。 - Sam DeHaan
2个回答

12
String s = new String(theByteArray);

应该真正成为

String s = new String(theByteArray, Charset.forName("UTF-8"));

这里的根本问题在于String构造函数并不智能。String构造函数无法区分正在使用的字符集,并尝试使用系统标准进行转换,通常是类似ASCII或ISO-8859-1这样的编码。这就是为什么正常的A-Za-z看起来合适,但其他所有内容都开始失败的原因。

byte是一个从-127到127运行的类型,因此对于UTF-8转换,连续的字节需要连接。由于String构造函数无法区分字节数组中的这一点,因此默认情况下它将逐个处理每个字节(这就是为什么基本的字母数字总是有效的,因为它们属于此范围)。

例子:

String text = "こんにちは";
byte[] array = text.getBytes("UTF-8");
String s = new String(array, Charset.forName("UTF-8"));
System.out.println(s); // Prints as expected
String sISO = new String(array, Charset.forName("ISO-8859-1")); // Prints 'ããã«ã¡ã¯'
System.out.println(sISO);

2
提供的代码存在几个问题:
  1. You are not ensuring that you are getting the UTF-8 byte array from that String.

    byte[] theByteArray = stringToConvert.getBytes();
    

    returns a byte array with the default encoding on the given platform, as described by the JavaDoc. What you actually want to do is the following:

    byte[] theByteArray = stringToConvert.getBytes("UTF-8");
    
  2. You should check the documentation for System.out.println():

    System.out.println(theByteArray);
    

    is calling System.out.println(Object x), which will print the results of x.toString(). By default, toString() returns the memory address of the given object.

    So when you see output of the form:

    INPUT :

    [B@5f1121f6

    inputText

    What you are seeing is the memory location of theByteArray and then the given input line of text.

  3. You seem to not understand the 'x.toString()' method. Remember, Strings in Java are immutable; None of String's methods will alter the String. theByteArray.toString(); returns a string representation of theByteArray;. The returned value is thrown out unless you give the value to another String

    String arrayAsString = theByteArray.toString();
    

    However, as previously described, the returned String will be the memory location of theByteArray. In order to print out the contents of theByteArray, you will need to convert it to a String

    String convertedString = new String(theByteArray, Charset.forName("UTF-8"));
    

假设您的要求是打印转换后的字符串,然后再打印原始字符串,您的代码应该类似于这样:

public static void main(String[] args) {

    Scanner h = new Scanner(System.in);
    System.out.println("INPUT : ");
    String stringToConvert = h.nextLine();

    try {
        // Array of the UTF-8 representation of the given String
        byte[] theByteArray;
        theByteArray = stringToConvert.getBytes("UTF-8");

        // The converted String
        System.out.println(new String(theByteArray, Charset.forName("UTF-8")));
    } catch (UnsupportedEncodingException e) {
        // We may provide an invalid character set
        e.printStackTrace();
    }

    // The original String
    System.out.println(stringToConvert);
}

非常感谢。您说很容易知道,但是这段代码不起作用。例如我的输入示例:Bác(基于越南语),转换回字符串后,我只看到方块-.-'。 - famfamfam
你是对的。我通过使用乔的回答纠正了我的错误。 - Jake Greene

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接