Java文本文件编码

Question

Java文本文件编码

javaencodingcharacter-encodingtext-files

12

我有一个文本文件，它可能是ANSI（采用ISO-8859-2字符集）、UTF-8、UCS-2大端或小端编码之一。

有没有办法检测文件的编码方式以便正确地读取它？

或者是否可以在不指定编码方式的情况下读取文件？（并且它会按原样读取文件）

（有几个程序可以检测和转换文本文件的编码/格式。）

- user

4个回答

9

UTF-8和UCS-2/UTF-16可以通过文件开头的字节顺序标记相对容易地区分。如果存在这个标记，则文件很可能采用该编码——但并非绝对确定。你可能会发现文件采用其中一种编码，但没有字节顺序标记。

我不太了解ISO-8859-2，但我不会惊讶于几乎每个文件都是该编码下的有效文本文件。你最好能够通过启发式方法进行检查。事实上，维基百科页面中提到，只有0x7f字节是无效的。

没有读取“原样”文件并获取文本的概念——文件是一系列字节，因此您必须应用字符编码才能将这些字节解码为字符。

- Jon Skeet

4

您可以使用ICU4J (http://icu-project.org/apiref/icu4j/)。

以下是我的代码：

            String charset = "ISO-8859-1"; //Default chartset, put whatever you want

            byte[] fileContent = null;
            FileInputStream fin = null;

            //create FileInputStream object
            fin = new FileInputStream(file.getPath());

            /*
             * Create byte array large enough to hold the content of the file.
             * Use File.length to determine size of the file in bytes.
             */
            fileContent = new byte[(int) file.length()];

            /*
             * To read content of the file in byte array, use
             * int read(byte[] byteArray) method of java FileInputStream class.
             *
             */
            fin.read(fileContent);

            byte[] data =  fileContent;

            CharsetDetector detector = new CharsetDetector();
            detector.setText(data);

            CharsetMatch cm = detector.detect();

            if (cm != null) {
                int confidence = cm.getConfidence();
                System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
                //Here you have the encode name and the confidence
                //In my case if the confidence is > 50 I return the encode, else I return the default value
                if (confidence > 50) {
                    charset = cm.getName();
                }
            }

请确保对所有需要使用try catch的地方都进行相应的处理。

希望这对您有所帮助。

- ssamuel68

0

如果您的文本文件是一个正确创建的Unicode文本文件，那么字节顺序标记（BOM）应该会告诉您所需的所有信息。有关BOM的更多详细信息，请参见此处。

如果不是，则必须使用一些编码检测库。

- Glen

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jonathan Holloway · Accepted Answer

是的，在Java中有多种方法可以进行字符编码检测。请看基于Mozilla算法的jchardet，还有cpdetector，以及IBM的名为ICU4j的项目。我建议看一下后者，因为它似乎比其他两个更可靠。它们都基于对二进制文件的统计分析，而ICU4j还会提供检测到字符编码的置信度水平，因此您可以在上面的情况下使用它。它的效果非常好。