安卓中的字符集检测

Question

安卓中的字符集检测

3

我的安卓应用程序获取SHOUTcast元数据并将其显示出来。我遇到了一个使用非英文字符集时显示乱码的问题。请问如何检测字符编码并正确地显示文本？不好意思，如果这是一个非常复杂的问题，我对这个主题并不熟悉。

相关流媒体地址为：http://skully.hopto.org:8000

- William Seemann

这取决于数据来源。对于您的链接，您可以打开页面的HTML代码，然后您会看到一行<meta content="text/html; charset=windows-1252" http-equiv="Content-Type">。这意味着编码是Windows-1252，如果您只使用此站点，则可以硬编码此编码名称并始终使用它。 - vortexwolf

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- gregko · Accepted Answer

正如vorrtex在他上面的评论中指出的那样，如果您的数据以格式良好的HTML代码形式出现，则可以从<meta content="...">标签中知道其编码，这是最好的情况。您可以使用以下代码将其转换为Android（或其他Java实现）字符串：

// assume you have your input data as byte array buf, and encoding
// something like "windows-1252", "UTF-8" or whatever
String str = new String(buf, encoding);
// now your string will display correctly

如果您不了解编码，接收到的数据是未知格式的原始文本，您仍然可以尝试使用统计语言模型猜测它，使用IBM的开源授权项目ICU（Unicode国际组件），商业使用也可以，网址为http://site.icu-project.org/。他们提供Java和C++库。我刚刚将他们的Java JAR ver. 51.2添加到我的Android项目中，效果非常好。我用于识别文本文件中字符编码的代码是：

public static String readFileAsStringGuessEncoding(String filePath)
{
    String s = null;
    try {
        File file = new File(filePath);
        byte [] fileData = new byte[(int)file.length()];
        DataInputStream dis = new DataInputStream(new FileInputStream(file));
        dis.readFully(fileData);
        dis.close();

        CharsetMatch match = new CharsetDetector().setText(fileData).detect();

        if (match != null) try {
            Lt.d("For file: " + filePath + " guessed enc: " + match.getName() + " conf: " + match.getConfidence());
            s = new String(fileData, match.getName());
        } catch (UnsupportedEncodingException ue) {
            s = null;
        }
        if (s == null)
            s = new String(fileData);
    } catch (Exception e) {
        Lt.e("Exception in readFileAsStringGuessEncoding(): " + e);
        e.printStackTrace();
    }
    return s;
}

Lt.d和Lt.e只是我用来代替Log.d(TAG, "blah...")的快捷方式。在我能够想到的所有测试文件中都运行良好。我只担心APK文件大小 - icu4j-51_2.jar超过9 MB，而我的整个包只有2.5 MB。但很容易隔离CharsetDetector及其依赖项，因此最终添加的不超过50 kB。我需要从ICU源代码中复制到我的项目中的Java类都位于core/src/com/ibm/icu/text目录下，它们是：

CharsetDetector
CharsetMatch
CharsetRecog_2022
CharsetRecog_mbcs
CharsetRecog_sbcs
CharsetRecog_Unicode
CharsetRecog_UTF8
CharsetRecognizer

此外，在CharsetRecog_sbcs.java中还有一个受保护的“ArabicShaping as;”成员，它想要拉取更多的类，但是结果证明对于字符集识别来说不需要，因此我将其注释掉了。就这些。希望能有所帮助。

Greg