Java: 如何确定流的正确字符集编码

Question

Java: 如何确定流的正确字符集编码

javafileencodingstreamcharacter-encoding

159

关于以下主题：

Java应用程序：无法正确读取iso-8859-1编码的文件

最佳方法是如何在程序中确定输入流/文件的正确字符集编码？

我已尝试使用以下内容：

File in =  new File(args[0]);
InputStreamReader r = new InputStreamReader(new FileInputStream(in));
System.out.println(r.getEncoding());

但是在我知道使用ISO8859_1编码的文件上，上述代码会产生错误的ASCII结果，这不正确，并且不允许我正确地将文件内容呈现回控制台。

- Joel

13

Eduard是正确的，“你不能确定任意字节流的编码”。所有其他提议都为您提供了最佳猜测的方法（和库）。但最终它们仍然只是猜测。 - Mihai Nita

11

Reader.getEncoding 返回读取器设置的编码，而在您的情况下，这是默认编码。 - Karol S

System.getProperty("file.encoding") it returns string. ex - FileInputStream fis = new FileInputStream(path); String encoding = System.getProperty("fis.encoding"); - Sathvik

16个回答

80

我曾使用过这个库，类似于 Java 中的 jchardet，用于检测编码： https://github.com/albfernandez/juniversalchardet

- Luciano Fiandesio

7

我发现这个更准确：http://jchardet.sourceforge.net/（我正在测试使用ISO 8859-1、Windows-1252和UTF-8编码的西欧语言文档）。 - Joel

2

这个 juniversalchardet 不起作用。它大部分时间提供的是 UTF-8 编码，即使文件完全是 windows-1212 编码。 - Brain

它无法检测到东欧的Windows-1250编码。 - Bernhard Döbler

我尝试使用以下代码片段检测“https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt”文件的字符集，但是得到了空值。UniversalDetector ud = new UniversalDetector(null); byte[] bytes = FileUtils.readFileToByteArray(new File(file)); ud.handleData(bytes, 0, bytes.length); ud.dataEnd(); detectedCharset = ud.getDetectedCharset(); - Rohit Verma

2

Juniversalchardet不支持ISO-8859-1，这是最常见的字符集之一。 - Thomas

40

看这个： http://site.icu-project.org/ (icu4j) 它们有用于从IOStream检测字符集的库，可能是这样简单的：

BufferedInputStream bis = new BufferedInputStream(input);
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
CharsetMatch cm = cd.detect();

if (cm != null) {
   reader = cm.getReader();
   charset = cm.getName();
}else {
   throw new UnsupportedCharsetException()
}

- user345883

2

我尝试过，但这很失败：我在Eclipse中创建了两个文本文件，都包含“öäüß”。一个设置为ISO编码，另一个设置为UTF8-两者都被检测为UTF8！然后，我尝试了从我的硬盘（Windows）保存的某个文件-这个文件被正确地检测为“windows-1252”。然后，我创建了两个新文件，一个使用编辑器编辑，另一个使用Notepad ++编辑。在这两种情况下，“Big5”（中文）都被检测到！ - dermoritz

2

编辑：好的，我应该检查cm.getConfidence() - 对于我的短文“äöüß”，置信度为10。所以我必须决定什么置信度足够好-但对于这个努力（字符集检测）来说绝对没问题。 - dermoritz

2

示例代码的直接链接：http://userguide.icu-project.org/conversion/detection - james.garriss

使用ICU4J进行字符集检测的主要问题在于其JAR文件大小为13MB。我已经从ICU4J中提取了chardet功能，并将其打包成一个独立的75KB库，位于https://github.com/sigpwned/chardet4j。相同的代码，更小的占用空间。 - sigpwned

31

这是我的收藏：

TikaEncodingDetector

依赖项：

<dependency>
  <groupId>org.apache.any23</groupId>
  <artifactId>apache-any23-encoding</artifactId>
  <version>1.1</version>
</dependency>

示例:

public static Charset guessCharset(InputStream is) throws IOException {
  return Charset.forName(new TikaEncodingDetector().guessEncoding(is));    
}

GuessEncoding

依赖项：

<dependency>
  <groupId>org.codehaus.guessencoding</groupId>
  <artifactId>guessencoding</artifactId>
  <version>1.4</version>
  <type>jar</type>
</dependency>

示例：

  public static Charset guessCharset2(File file) throws IOException {
    return CharsetToolkit.guessEncoding(file, 4096, StandardCharsets.UTF_8);
  }

- Benny Code

4

“Nota:” TikaEncodingDetector 1.1实际上是ICU4J 3.4的CharsetDectector类的一个薄包装器。 - Stephan

很不幸，这两个库都无法正常工作。在其中一个情况下，它将带有德语Umlaute的UTF-8文件识别为ISO-8859-1和US-ASCII。 - Brain

1

@Brain：你测试的文件是否实际上是UTF-8格式，并且是否包含BOM（https://en.wikipedia.org/wiki/Byte_order_mark）？ - Benny Code

@BennyNeugebauer 这个文件是UTF-8编码，没有BOM。我用Notepad++检查过了，也通过更改编码并确认“Umlaute”仍然可见来验证了它。 - Brain

15

该使用哪个库？

截至本文撰写时，出现了三个库：

我不包括 Apache Any23，因为它在底层使用 ICU4j 3.4。

如何确定哪个检测到了正确的字符集(或尽可能接近)?

无法证明以上每个库检测到的字符集。但是，可以依次询问它们并对返回的响应进行评分。

如何对返回的响应进行评分？

可以为每个响应分配一个分数。响应得分越高，检测到的字符集的可信度就越高。这是一种简单的评分方法。您可以制定其他评分方法。

有样例代码吗？

这是一个完整的代码片段，实现了上述策略。

public static String guessEncoding(InputStream input) throws IOException {
    // Load input data
    long count = 0;
    int n = 0, EOF = -1;
    byte[] buffer = new byte[4096];
    ByteArrayOutputStream output = new ByteArrayOutputStream();

    while ((EOF != (n = input.read(buffer))) && (count <= Integer.MAX_VALUE)) {
        output.write(buffer, 0, n);
        count += n;
    }
    
    if (count > Integer.MAX_VALUE) {
        throw new RuntimeException("Inputstream too large.");
    }

    byte[] data = output.toByteArray();

    // Detect encoding
    Map<String, int[]> encodingsScores = new HashMap<>();

    // * GuessEncoding
    updateEncodingsScores(encodingsScores, new CharsetToolkit(data).guessEncoding().displayName());

    // * ICU4j
    CharsetDetector charsetDetector = new CharsetDetector();
    charsetDetector.setText(data);
    charsetDetector.enableInputFilter(true);
    CharsetMatch cm = charsetDetector.detect();
    if (cm != null) {
        updateEncodingsScores(encodingsScores, cm.getName());
    }

    // * juniversalchardset
    UniversalDetector universalDetector = new UniversalDetector(null);
    universalDetector.handleData(data, 0, data.length);
    universalDetector.dataEnd();
    String encodingName = universalDetector.getDetectedCharset();
    if (encodingName != null) {
        updateEncodingsScores(encodingsScores, encodingName);
    }

    // Find winning encoding
    Map.Entry<String, int[]> maxEntry = null;
    for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
        if (maxEntry == null || (e.getValue()[0] > maxEntry.getValue()[0])) {
            maxEntry = e;
        }
    }

    String winningEncoding = maxEntry.getKey();
    //dumpEncodingsScores(encodingsScores);
    return winningEncoding;
}

private static void updateEncodingsScores(Map<String, int[]> encodingsScores, String encoding) {
    String encodingName = encoding.toLowerCase();
    int[] encodingScore = encodingsScores.get(encodingName);

    if (encodingScore == null) {
        encodingsScores.put(encodingName, new int[] { 1 });
    } else {
        encodingScore[0]++;
    }
}    

private static void dumpEncodingsScores(Map<String, int[]> encodingsScores) {
    System.out.println(toString(encodingsScores));
}

private static String toString(Map<String, int[]> encodingsScores) {
    String GLUE = ", ";
    StringBuilder sb = new StringBuilder();

    for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
        sb.append(e.getKey() + ":" + e.getValue()[0] + GLUE);
    }
    int len = sb.length();
    sb.delete(len - GLUE.length(), len);

    return "{ " + sb.toString() + " }";
}

改进： guessEncoding方法会完整读取输入流。对于大型输入流，这可能是一个问题。所有这些库都将读取整个输入流。这将意味着检测字符集需要大量时间消耗。

可以将初始数据加载限制为几个字节，并只对这些少量字节执行字符集检测。

- Stephan

14

你可以使用CharsetDecoder对文件进行解码，并注意"malformed-input"或"unmappable-character"错误来验证特定字符集。当然，这只能告诉你字符集是否错误，无法告诉你它是否正确。为此，你需要一个比较基准来评估解码结果，例如，你事先知道字符是否限制在某些子集中，或者文本是否遵循某种严格的格式？总之，字符集检测是一种猜测，没有任何保证。

- Zach Scrivena

13

据我所知，在这个背景下没有一个通用的库适用于所有类型的问题。因此，对于每个问题，您都应该测试现有的库，并选择满足问题约束的最佳库，但通常没有一个是合适的。在这些情况下，您可以编写自己的编码检测器！正如我所写的...

我已经编写了一个元java工具，可使用IBM ICU4j和Mozilla JCharDet作为内置组件来检测HTML网页的字符集编码。在这里，您可以找到我的工具，请在任何其他操作之前阅读README部分。此外，您可以在我的论文以及其参考文献中找到此问题的一些基本概念。

以下是我在工作中获得的一些有用的注释：

- 字符集检测不是绝对可靠的过程，因为它基本上是基于统计数据的，实际上发生的是“猜测”，而不是“检测”。 - icu4j是IBM在这个背景下的主要工具，imho - TikaEncodingDetector和Lucene-ICU4j都使用icu4j，它们的准确性与我在测试中使用的icu4j没有显着差异（最多为1％，如我所记得的那样）。 - icu4j比jchardet更加通用，icu4j只是有点偏向于IBM家族编码，而jchardet则非常偏爱utf-8。 - 由于在HTML世界中广泛使用UTF-8； jchardet是总体上比icu4j更好的选择，但不是最佳选择！ - icu4j非常适合东亚特定的编码，例如EUC-KR、EUC-JP、SHIFT_JIS、BIG5和GB家族编码。

icu4j和jchardet在处理使用Windows-1251和Windows-1256编码的HTML页面时存在问题。Windows-1251，也称为cp1251，广泛用于基于西里尔字母的语言，如俄语；Windows-1256，也称为cp1256，广泛用于阿拉伯语。

几乎所有编码检测工具都使用统计方法，因此输出的准确性强烈依赖于输入的大小和内容。

有些编码本质上是相同的，只是存在部分差异，在某些情况下猜测或检测到的编码可能是错误的，但同时也可能是正确的！例如关于Windows-1252和ISO-8859-1的最后一段，参见我的论文5.2节的末尾段落。

- faghani

1

这个问题被很多糟糕和重复的答案所淹没。非常感谢迄今为止最好的答案。 - Douglas Held

@DouglasHeld 很高兴能够帮到你。这个线程是 stackoverflow 中马太效应的一个很好的例子！ - faghani

6

上述库只是简单的BOM检测器，当然只有在文件开头有BOM时才能工作。请查看http://jchardet.sourceforge.net/，它可以扫描文本。

- Lorrat

21

只是一个提示，但这个网站上不存在“above” - 考虑说明你所指的库。 - McDowell

5

如果您使用ICU4J (http://icu-project.org/apiref/icu4j/)，以下是我的代码:

String charset = "ISO-8859-1"; //Default chartset, put whatever you want

byte[] fileContent = null;
FileInputStream fin = null;

//create FileInputStream object
fin = new FileInputStream(file.getPath());

/*
 * Create byte array large enough to hold the content of the file.
 * Use File.length to determine size of the file in bytes.
 */
fileContent = new byte[(int) file.length()];

/*
 * To read content of the file in byte array, use
 * int read(byte[] byteArray) method of java FileInputStream class.
 *
 */
fin.read(fileContent);

byte[] data =  fileContent;

CharsetDetector detector = new CharsetDetector();
detector.setText(data);

CharsetMatch cm = detector.detect();

if (cm != null) {
    int confidence = cm.getConfidence();
    System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
    //Here you have the encode name and the confidence
    //In my case if the confidence is > 50 I return the encode, else I return the default value
    if (confidence > 50) {
        charset = cm.getName();
    }
}

记得在所有需要的地方加上try-catch语句。

希望这对你有用。

- ssamuel68

在我看来，这个答案还有改进的空间。如果你想使用ICU4j，可以尝试使用这个链接：https://dev59.com/InRB5IYBdhLWcg3w1Khe#4013565。 - Stephan

4

我发现了一个不错的第三方库，可以检测实际编码：http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding 我没有进行过详细测试，但它似乎可以工作。

- falcon

1

“GuessEncoding” 项目网站的链接是：https://xircles.codehaus.org/p/guessencoding。 - Benny Code

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Eduard Wirch · Accepted Answer

您无法确定任意字节流的编码。这是编码的本质。编码意味着字节值和其表示之间的映射。因此，每种编码“可能”都是正确的。

getEncoding() 方法将返回为流设置的编码（请阅读JavaDoc）。它不会为您猜测编码。

有些流会告诉您创建它们所使用的编码：XML、HTML。但不是任意字节流。

不管怎样，如果必须的话，您可以尝试自己猜测一种编码。每种语言都有每个字符的通用频率。在英语中，字符e经常出现，但ê很少见。在ISO-8859-1流中，通常没有0x00字符。但UTF-16流有很多。

或者：您可以询问用户。我已经看到过一些应用程序，它们会向您呈现文件的片段，并要求您选择“正确”的编码。