如何检测文本文件的字符编码?

80

我尝试检测文件使用的字符编码。

我使用以下代码获取标准编码。

public static Encoding GetFileEncoding(string srcFile)
    {
      // *** Use Default of Encoding.Default (Ansi CodePage)
      Encoding enc = Encoding.Default;

      // *** Detect byte order mark if any - otherwise assume default
      byte[] buffer = new byte[5];
      FileStream file = new FileStream(srcFile, FileMode.Open);
      file.Read(buffer, 0, 5);
      file.Close();

      if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
        enc = Encoding.UTF8;
      else if (buffer[0] == 0xfe && buffer[1] == 0xff)
        enc = Encoding.Unicode;
      else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
        enc = Encoding.UTF32;
      else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
        enc = Encoding.UTF7;
      else if (buffer[0] == 0xFE && buffer[1] == 0xFF)      
        // 1201 unicodeFFFE Unicode (Big-Endian)
        enc = Encoding.GetEncoding(1201);      
      else if (buffer[0] == 0xFF && buffer[1] == 0xFE)      
        // 1200 utf-16 Unicode
        enc = Encoding.GetEncoding(1200);


      return enc;
    }

我的前五个字节是60、118、56、46和49。

有没有一张图表显示哪种编码与这五个前缀字节匹配?


4
字节顺序标记不应用于检测编码。存在一些情况不确定使用哪种编码:UTF-16 LE和UTF-32 LE都以相同的两个字节开头。BOM仅应用于检测字节顺序(因此得名)。另外,严格来说,UTF-8不应该甚至有字节顺序标记,添加一个可能会干扰一些不希望它的软件。 - Mark Byers
5
@Mark Byers:UTF-32 LE的起始字节与UTF-16 LE相同,但它后面还跟着字节00 00,这在UTF-16 LE中(我认为)非常不可能。另外,BOM理论上应该如你所说指示编码方式,但在实践中,它作为标识来显示编码方式。参见:http://www.unicode.org/faq/utf_bom.html#bom4 - Dan W
3
Mark Beyers:你的评论完全是错的。BOM 是一种检测编码的绝对可靠的方式。UTF16 BE 和 UTF32 BE 不会有歧义。在写评论之前,你应该学习这个主题。如果一个软件不能处理 UTF8 BOM,那么这个软件要么来自上世纪八十年代,要么就是程序设计糟糕。现在,所有的软件都应该能够处理和识别 BOM。 - Elmue
可能是如何检测文本文件的编码/代码页的重复问题。 - TarmoPikaro
1
Elmue 显然从未使用过批量筛选、连接和流重定向纯文本文件。在这种情况下处理/支持 BOM 是不现实的。 - jstine
显示剩余2条评论
9个回答

91
您不能依赖文件具有BOM。UTF-8不需要它。而非Unicode编码甚至没有BOM。然而,有其他方法可以检测编码。
UTF-32的BOM是00 00 FE FF(BE)或FF FE 00 00(LE)。
但即使没有BOM,也很容易检测到UTF-32。这是因为Unicode代码点范围限制为U+10FFFF,因此UTF-32单元始终具有模式00 {00-10} xx xx(BE)或xx xx {00-10} 00(LE)。如果数据长度是4的倍数,并遵循这些模式之一,则可以安全地假定它是UTF-32。由于字节导向编码中00字节的稀少性,误报率几乎不可能。
US-ASCII没有BOM,但您不需要BOM。 ASCII可以通过80-FF范围内的缺乏字节轻松识别。
UTF-8的BOM是EF BB BF。但您不能依赖它。许多UTF-8文件没有BOM,特别是它们来自非Windows系统。
但是,如果文件验证为UTF-8,则可以安全地假定它是UTF-8。误报率很低。
具体而言,鉴于数据不是ASCII,2字节序列的误报率仅为3.9%(1920/49152)。对于7字节序列,它小于1%。对于12字节序列,它小于0.1%。对于24字节序列,它小于一百万分之一。
UTF-16的BOM是FE FF(BE)或FF FE(LE)。请注意,UTF-16LE BOM位于UTF-32LE BOM的开头,因此请先检查UTF-32。如果您有一个主要由ISO-8859-1字符组成的文件,那么文件字节的一半为00也是识别UTF-16的强有力指标。

否则,在没有BOM的情况下识别UTF-16的唯一可靠方法是查找代理对(D [8-B]xx D [C-F]xx),但非BMP字符使用太少,这种方法并不实用。

XML

如果您的文件以字节3C 3F 78 6D 6C(即ASCII字符“<?xml”)开头,请查找encoding= 声明。 如果存在,则使用该编码。 如果缺少,则假定使用UTF-8,这是XML的默认编码。

如果需要支持EBCDIC,请还要查找等效序列4C 6F A7 94 93。

通常,如果您有包含编码声明的文件格式,则应查找该声明而不是尝试猜测编码。

以上都不是

还有数百种其他编码,需要更多的努力来检测。 我建议尝试使用Mozilla的字符集探测器其.NET版本

合理的默认值

如果您已经排除了UTF编码,并且没有编码声明或统计检测指向不同的编码,请假定使用ISO-8859-1或紧密相关的Windows-1252编码。(请注意,最新的HTML标准“需要”将“ISO-8859-1”声明解释为Windows-1252)。作为英语(以及其他流行语言,如西班牙语、葡萄牙语、德语和法语)的Windows默认代码页,它是除UTF-8外最常遇到的编码。


1
好的,正如我所期望的。你能解释一下UTF-8/UTF-16的区别吗?附言:感谢你非常有帮助的回答。+1 - Ira Baxter
2
对于UTF-16BE文本文件,如果一定比例的偶数字节被清零(或者对于UTF-16LE检查奇数字节),那么很有可能编码是UTF-16。你认为呢? - Dan W
1
UTF-8的有效性可以通过进行位模式检查来很好地检测;第一个字节的位模式准确地告诉您后面会跟随多少个字节,并且后续字节也有控制位可供检查。这些模式在此处全部显示:https://ianthehenry.com/2015/1/17/decoding-utf-8/ - Nyerguds
2
@marsze 这不是我的答案...而且它没有被提及,因为这是关于“检测”的问题,正如我所提到的,你不能真正检测出简单的每个符号一个字节的编码。我个人在这个地方发布了一个答案,关于(模糊地)识别它。 - Nyerguds
2
@marsze:好的,我已经添加了一个Latin-1的部分。 - dan04
显示剩余10条评论

13

如果你想追求一个“简单”的解决方案,你可能会发现我整理的这个类很有用:

http://www.architectshack.com/TextFileEncodingDetector.ashx

它会自动进行BOM检测,然后尝试区分没有BOM的Unicode编码和其他默认编码(通常是Windows-1252,在.Net中错误地标记为Encoding.ASCII)。
如上所述,涉及NCharDet或MLang的“重型”解决方案可能更为合适,并且正如我在此类概述页面上指出的那样,最好如果可能与用户提供某种形式的交互,因为根本不存在100%的检测率!
如果网站离线,可以使用代码片段:
using System;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace KlerksSoft
{
    public static class TextFileEncodingDetector
    {
        /*
         * Simple class to handle text file encoding woes (in a primarily English-speaking tech 
         *      world).
         * 
         *  - This code is fully managed, no shady calls to MLang (the unmanaged codepage
         *      detection library originally developed for Internet Explorer).
         * 
         *  - This class does NOT try to detect arbitrary codepages/charsets, it really only
         *      aims to differentiate between some of the most common variants of Unicode 
         *      encoding, and a "default" (western / ascii-based) encoding alternative provided
         *      by the caller.
         *      
         *  - As there is no "Reliable" way to distinguish between UTF-8 (without BOM) and 
         *      Windows-1252 (in .Net, also incorrectly called "ASCII") encodings, we use a 
         *      heuristic - so the more of the file we can sample the better the guess. If you 
         *      are going to read the whole file into memory at some point, then best to pass 
         *      in the whole byte byte array directly. Otherwise, decide how to trade off 
         *      reliability against performance / memory usage.
         *      
         *  - The UTF-8 detection heuristic only works for western text, as it relies on 
         *      the presence of UTF-8 encoded accented and other characters found in the upper 
         *      ranges of the Latin-1 and (particularly) Windows-1252 codepages.
         *  
         *  - For more general detection routines, see existing projects / resources:
         *    - MLang - Microsoft library originally for IE6, available in Windows XP and later APIs now (I think?)
         *      - MLang .Net bindings: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
         *    - CharDet - Mozilla browser's detection routines
         *      - Ported to Java then .Net: http://www.conceptdevelopment.net/Localization/NCharDet/
         *      - Ported straight to .Net: http://code.google.com/p/chardetsharp/source/browse
         *  
         * Copyright Tao Klerks, 2010-2012, tao@klerks.biz
         * Licensed under the modified BSD license:
         * 
Redistribution and use in source and binary forms, with or without modification, are 
permitted provided that the following conditions are met:
 - Redistributions of source code must retain the above copyright notice, this list of 
conditions and the following disclaimer.
 - Redistributions in binary form must reproduce the above copyright notice, this list 
of conditions and the following disclaimer in the documentation and/or other materials
provided with the distribution.
 - The name of the author may not be used to endorse or promote products derived from 
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, 
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY 
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY 
OF SUCH DAMAGE.
         * 
         * CHANGELOG:
         *  - 2012-02-03: 
         *    - Simpler methods, removing the silly "DefaultEncoding" parameter (with "??" operator, saves no typing)
         *    - More complete methods
         *      - Optionally return indication of whether BOM was found in "Detect" methods
         *      - Provide straight-to-string method for byte arrays (GetStringFromByteArray)
         */

        const long _defaultHeuristicSampleSize = 0x10000; //completely arbitrary - inappropriate for high numbers of files / high speed requirements

        public static Encoding DetectTextFileEncoding(string InputFilename)
        {
            using (FileStream textfileStream = File.OpenRead(InputFilename))
            {
                return DetectTextFileEncoding(textfileStream, _defaultHeuristicSampleSize);
            }
        }

        public static Encoding DetectTextFileEncoding(FileStream InputFileStream, long HeuristicSampleSize)
        {
            bool uselessBool = false;
            return DetectTextFileEncoding(InputFileStream, _defaultHeuristicSampleSize, out uselessBool);
        }

        public static Encoding DetectTextFileEncoding(FileStream InputFileStream, long HeuristicSampleSize, out bool HasBOM)
        {
            if (InputFileStream == null)
                throw new ArgumentNullException("Must provide a valid Filestream!", "InputFileStream");

            if (!InputFileStream.CanRead)
                throw new ArgumentException("Provided file stream is not readable!", "InputFileStream");

            if (!InputFileStream.CanSeek)
                throw new ArgumentException("Provided file stream cannot seek!", "InputFileStream");

            Encoding encodingFound = null;

            long originalPos = InputFileStream.Position;

            InputFileStream.Position = 0;


            //First read only what we need for BOM detection
            byte[] bomBytes = new byte[InputFileStream.Length > 4 ? 4 : InputFileStream.Length];
            InputFileStream.Read(bomBytes, 0, bomBytes.Length);

            encodingFound = DetectBOMBytes(bomBytes);

            if (encodingFound != null)
            {
                InputFileStream.Position = originalPos;
                HasBOM = true;
                return encodingFound;
            }


            //BOM Detection failed, going for heuristics now.
            //  create sample byte array and populate it
            byte[] sampleBytes = new byte[HeuristicSampleSize > InputFileStream.Length ? InputFileStream.Length : HeuristicSampleSize];
            Array.Copy(bomBytes, sampleBytes, bomBytes.Length);
            if (InputFileStream.Length > bomBytes.Length)
                InputFileStream.Read(sampleBytes, bomBytes.Length, sampleBytes.Length - bomBytes.Length);
            InputFileStream.Position = originalPos;

            //test byte array content
            encodingFound = DetectUnicodeInByteSampleByHeuristics(sampleBytes);

            HasBOM = false;
            return encodingFound;
        }

        public static Encoding DetectTextByteArrayEncoding(byte[] TextData)
        {
            bool uselessBool = false;
            return DetectTextByteArrayEncoding(TextData, out uselessBool);
        }

        public static Encoding DetectTextByteArrayEncoding(byte[] TextData, out bool HasBOM)
        {
            if (TextData == null)
                throw new ArgumentNullException("Must provide a valid text data byte array!", "TextData");

            Encoding encodingFound = null;

            encodingFound = DetectBOMBytes(TextData);

            if (encodingFound != null)
            {
                HasBOM = true;
                return encodingFound;
            }
            else
            {
                //test byte array content
                encodingFound = DetectUnicodeInByteSampleByHeuristics(TextData);

                HasBOM = false;
                return encodingFound;
            }
        }

        public static string GetStringFromByteArray(byte[] TextData, Encoding DefaultEncoding)
        {
            return GetStringFromByteArray(TextData, DefaultEncoding, _defaultHeuristicSampleSize);
        }

        public static string GetStringFromByteArray(byte[] TextData, Encoding DefaultEncoding, long MaxHeuristicSampleSize)
        {
            if (TextData == null)
                throw new ArgumentNullException("Must provide a valid text data byte array!", "TextData");

            Encoding encodingFound = null;

            encodingFound = DetectBOMBytes(TextData);

            if (encodingFound != null)
            {
                //For some reason, the default encodings don't detect/swallow their own preambles!!
                return encodingFound.GetString(TextData, encodingFound.GetPreamble().Length, TextData.Length - encodingFound.GetPreamble().Length);
            }
            else
            {
                byte[] heuristicSample = null;
                if (TextData.Length > MaxHeuristicSampleSize)
                {
                    heuristicSample = new byte[MaxHeuristicSampleSize];
                    Array.Copy(TextData, heuristicSample, MaxHeuristicSampleSize);
                }
                else
                {
                    heuristicSample = TextData;
                }

                encodingFound = DetectUnicodeInByteSampleByHeuristics(TextData) ?? DefaultEncoding;
                return encodingFound.GetString(TextData);
            }
        }


        public static Encoding DetectBOMBytes(byte[] BOMBytes)
        {
            if (BOMBytes == null)
                throw new ArgumentNullException("Must provide a valid BOM byte array!", "BOMBytes");

            if (BOMBytes.Length < 2)
                return null;

            if (BOMBytes[0] == 0xff 
                && BOMBytes[1] == 0xfe 
                && (BOMBytes.Length < 4 
                    || BOMBytes[2] != 0 
                    || BOMBytes[3] != 0
                    )
                )
                return Encoding.Unicode;

            if (BOMBytes[0] == 0xfe 
                && BOMBytes[1] == 0xff
                )
                return Encoding.BigEndianUnicode;

            if (BOMBytes.Length < 3)
                return null;

            if (BOMBytes[0] == 0xef && BOMBytes[1] == 0xbb && BOMBytes[2] == 0xbf)
                return Encoding.UTF8;

            if (BOMBytes[0] == 0x2b && BOMBytes[1] == 0x2f && BOMBytes[2] == 0x76)
                return Encoding.UTF7;

            if (BOMBytes.Length < 4)
                return null;

            if (BOMBytes[0] == 0xff && BOMBytes[1] == 0xfe && BOMBytes[2] == 0 && BOMBytes[3] == 0)
                return Encoding.UTF32;

            if (BOMBytes[0] == 0 && BOMBytes[1] == 0 && BOMBytes[2] == 0xfe && BOMBytes[3] == 0xff)
                return Encoding.GetEncoding(12001);

            return null;
        }

        public static Encoding DetectUnicodeInByteSampleByHeuristics(byte[] SampleBytes)
        {
            long oddBinaryNullsInSample = 0;
            long evenBinaryNullsInSample = 0;
            long suspiciousUTF8SequenceCount = 0;
            long suspiciousUTF8BytesTotal = 0;
            long likelyUSASCIIBytesInSample = 0;

            //Cycle through, keeping count of binary null positions, possible UTF-8 
            //  sequences from upper ranges of Windows-1252, and probable US-ASCII 
            //  character counts.

            long currentPos = 0;
            int skipUTF8Bytes = 0;

            while (currentPos < SampleBytes.Length)
            {
                //binary null distribution
                if (SampleBytes[currentPos] == 0)
                {
                    if (currentPos % 2 == 0)
                        evenBinaryNullsInSample++;
                    else
                        oddBinaryNullsInSample++;
                }

                //likely US-ASCII characters
                if (IsCommonUSASCIIByte(SampleBytes[currentPos]))
                    likelyUSASCIIBytesInSample++;

                //suspicious sequences (look like UTF-8)
                if (skipUTF8Bytes == 0)
                {
                    int lengthFound = DetectSuspiciousUTF8SequenceLength(SampleBytes, currentPos);

                    if (lengthFound > 0)
                    {
                        suspiciousUTF8SequenceCount++;
                        suspiciousUTF8BytesTotal += lengthFound;
                        skipUTF8Bytes = lengthFound - 1;
                    }
                }
                else
                {
                    skipUTF8Bytes--;
                }

                currentPos++;
            }

            //1: UTF-16 LE - in english / european environments, this is usually characterized by a 
            //  high proportion of odd binary nulls (starting at 0), with (as this is text) a low 
            //  proportion of even binary nulls.
            //  The thresholds here used (less than 20% nulls where you expect non-nulls, and more than
            //  60% nulls where you do expect nulls) are completely arbitrary.

            if (((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2 
                && ((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6
                )
                return Encoding.Unicode;


            //2: UTF-16 BE - in english / european environments, this is usually characterized by a 
            //  high proportion of even binary nulls (starting at 0), with (as this is text) a low 
            //  proportion of odd binary nulls.
            //  The thresholds here used (less than 20% nulls where you expect non-nulls, and more than
            //  60% nulls where you do expect nulls) are completely arbitrary.

            if (((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2 
                && ((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6
                )
                return Encoding.BigEndianUnicode;


            //3: UTF-8 - Martin Dürst outlines a method for detecting whether something CAN be UTF-8 content 
            //  using regexp, in his w3c.org unicode FAQ entry: 
            //  http://www.w3.org/International/questions/qa-forms-utf-8
            //  adapted here for C#.
            string potentiallyMangledString = Encoding.ASCII.GetString(SampleBytes);
            Regex UTF8Validator = new Regex(@"\A(" 
                + @"[\x09\x0A\x0D\x20-\x7E]"
                + @"|[\xC2-\xDF][\x80-\xBF]"
                + @"|\xE0[\xA0-\xBF][\x80-\xBF]"
                + @"|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}"
                + @"|\xED[\x80-\x9F][\x80-\xBF]"
                + @"|\xF0[\x90-\xBF][\x80-\xBF]{2}"
                + @"|[\xF1-\xF3][\x80-\xBF]{3}"
                + @"|\xF4[\x80-\x8F][\x80-\xBF]{2}"
                + @")*\z");
            if (UTF8Validator.IsMatch(potentiallyMangledString))
            {
                //Unfortunately, just the fact that it CAN be UTF-8 doesn't tell you much about probabilities.
                //If all the characters are in the 0-127 range, no harm done, most western charsets are same as UTF-8 in these ranges.
                //If some of the characters were in the upper range (western accented characters), however, they would likely be mangled to 2-byte by the UTF-8 encoding process.
                // So, we need to play stats.

                // The "Random" likelihood of any pair of randomly generated characters being one 
                //   of these "suspicious" character sequences is:
                //     128 / (256 * 256) = 0.2%.
                //
                // In western text data, that is SIGNIFICANTLY reduced - most text data stays in the <127 
                //   character range, so we assume that more than 1 in 500,000 of these character 
                //   sequences indicates UTF-8. The number 500,000 is completely arbitrary - so sue me.
                //
                // We can only assume these character sequences will be rare if we ALSO assume that this
                //   IS in fact western text - in which case the bulk of the UTF-8 encoded data (that is 
                //   not already suspicious sequences) should be plain US-ASCII bytes. This, I 
                //   arbitrarily decided, should be 80% (a random distribution, eg binary data, would yield 
                //   approx 40%, so the chances of hitting this threshold by accident in random data are 
                //   VERY low). 

                if ((suspiciousUTF8SequenceCount * 500000.0 / SampleBytes.Length >= 1) //suspicious sequences
                    && (
                           //all suspicious, so cannot evaluate proportion of US-Ascii
                           SampleBytes.Length - suspiciousUTF8BytesTotal == 0 
                           ||
                           likelyUSASCIIBytesInSample * 1.0 / (SampleBytes.Length - suspiciousUTF8BytesTotal) >= 0.8
                       )
                    )
                    return Encoding.UTF8;
            }

            return null;
        }

        private static bool IsCommonUSASCIIByte(byte testByte)
        {
            if (testByte == 0x0A //lf
                || testByte == 0x0D //cr
                || testByte == 0x09 //tab
                || (testByte >= 0x20 && testByte <= 0x2F) //common punctuation
                || (testByte >= 0x30 && testByte <= 0x39) //digits
                || (testByte >= 0x3A && testByte <= 0x40) //common punctuation
                || (testByte >= 0x41 && testByte <= 0x5A) //capital letters
                || (testByte >= 0x5B && testByte <= 0x60) //common punctuation
                || (testByte >= 0x61 && testByte <= 0x7A) //lowercase letters
                || (testByte >= 0x7B && testByte <= 0x7E) //common punctuation
                )
                return true;
            else
                return false;
        }

        private static int DetectSuspiciousUTF8SequenceLength(byte[] SampleBytes, long currentPos)
        {
            int lengthFound = 0;

            if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC2
                )
            {
                if (SampleBytes[currentPos + 1] == 0x81 
                    || SampleBytes[currentPos + 1] == 0x8D 
                    || SampleBytes[currentPos + 1] == 0x8F
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0x90 
                    || SampleBytes[currentPos + 1] == 0x9D
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] >= 0xA0 
                    && SampleBytes[currentPos + 1] <= 0xBF
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC3
                )
            {
                if (SampleBytes[currentPos + 1] >= 0x80 
                    && SampleBytes[currentPos + 1] <= 0xBF
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC5
                )
            {
                if (SampleBytes[currentPos + 1] == 0x92 
                    || SampleBytes[currentPos + 1] == 0x93
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0xA0 
                    || SampleBytes[currentPos + 1] == 0xA1
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0xB8 
                    || SampleBytes[currentPos + 1] == 0xBD 
                    || SampleBytes[currentPos + 1] == 0xBE
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC6
                )
            {
                if (SampleBytes[currentPos + 1] == 0x92)
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xCB
                )
            {
                if (SampleBytes[currentPos + 1] == 0x86 
                    || SampleBytes[currentPos + 1] == 0x9C
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 2 
                && SampleBytes[currentPos] == 0xE2
                )
            {
                if (SampleBytes[currentPos + 1] == 0x80)
                {
                    if (SampleBytes[currentPos + 2] == 0x93 
                        || SampleBytes[currentPos + 2] == 0x94
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0x98 
                        || SampleBytes[currentPos + 2] == 0x99 
                        || SampleBytes[currentPos + 2] == 0x9A
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0x9C 
                        || SampleBytes[currentPos + 2] == 0x9D 
                        || SampleBytes[currentPos + 2] == 0x9E
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xA0 
                        || SampleBytes[currentPos + 2] == 0xA1 
                        || SampleBytes[currentPos + 2] == 0xA2
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xA6)
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xB0)
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xB9 
                        || SampleBytes[currentPos + 2] == 0xBA
                        )
                        lengthFound = 3;
                }
                else if (SampleBytes[currentPos + 1] == 0x82 
                    && SampleBytes[currentPos + 2] == 0xAC
                    )
                    lengthFound = 3;
                else if (SampleBytes[currentPos + 1] == 0x84 
                    && SampleBytes[currentPos + 2] == 0xA2
                    )
                    lengthFound = 3;
            }

            return lengthFound;
        }

    }
}

1
实际上,使用 Encoding.GetEncoding("Windows-1252")Encoding.ASCII 返回不同的对象类。在调试时,Windows-1252 显示为 System.Text.SBCSCodePageEncoding 对象,而 ASCII 是一个 System.Text.ASCIIEncoding 对象。当我需要 Windows-1252 时,我从不使用 ASCII。 - Nyerguds
匹配二进制数据(字节)的正则表达式的正确方法是:string data = Encoding.GetEncoding("iso-8859-1").GetString(bytes);因为它是唯一一个具有一对一字节映射到字符串的单字节编码。 - Amr Ali

7

4
不工作,StreamReader 假设你的文件是 UTF-8 编码。 - Cédric Boivin
@Cedric:请查看MSDN以获取此构造函数的信息。您有证据表明该构造函数与文档不一致吗?当然,这在微软的文档中是可能的 :-) - Phil Hunt
5
这个版本仅检查BOM。 - Daniel Bişar
2
嗯,读取CurrentEncoding之前不必调用Read()吗?CurrentEncoding的MSDN说:“自动检测编码是在第一次调用StreamReader的任何Read方法之后才进行的,因此第一次调用后该值可能会有所不同。” - Carl Walsh
1
我的测试显示这不能可靠地使用,因此根本不应该使用。 - Geoffrey McGrath
显示剩余2条评论

7

这里有几个答案,但没有人发布有用的代码。

这是我写的代码,可以检测Microsoft在Framework 4中在StreamReader类中检测到的所有编码。

显然,在从流中读取任何其他内容之前,必须立即调用此函数,因为BOM是流中的第一个字节。

此函数需要可寻址的流(例如FileStream)。如果您有一个不能寻址的流,则必须编写更复杂的代码,以返回已经被读取但不是BOM的字节缓冲区。

public static Encoding DetectEncoding(String s_Path)
{
    using (FileStream i_Stream = new FileStream(s_Path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
    {
        return DetectEncoding(i_Stream);
    }
}

/// <summary>
/// UTF8    : EF BB BF
/// UTF16 BE: FE FF
/// UTF16 LE: FF FE
/// UTF32 BE: 00 00 FE FF
/// UTF32 LE: FF FE 00 00
/// </summary>
public static Encoding DetectEncoding(Stream i_Stream)
{
    if (!i_Stream.CanSeek || !i_Stream.CanRead)
        throw new Exception("DetectEncoding() requires a seekable and readable Stream");

    // Try to read 4 bytes. If the stream is shorter, less bytes will be read.
    Byte[] u8_Buf = new Byte[4];
    int s32_Count = i_Stream.Read(u8_Buf, 0, 4);
    if (s32_Count >= 2)
    {
        if (u8_Buf[0] == 0xFE && u8_Buf[1] == 0xFF)
        {
            i_Stream.Position = 2;
            return new UnicodeEncoding(true, true);
        }

        if (u8_Buf[0] == 0xFF && u8_Buf[1] == 0xFE)
        {
            if (s32_Count >= 4 && u8_Buf[2] == 0 && u8_Buf[3] == 0)
            {
                i_Stream.Position = 4;
                return new UTF32Encoding(false, true);
            }
            else
            {
                i_Stream.Position = 2;
                return new UnicodeEncoding(false, true);
            }
        }

        if (s32_Count >= 3 && u8_Buf[0] == 0xEF && u8_Buf[1] == 0xBB && u8_Buf[2] == 0xBF)
        {
            i_Stream.Position = 3;
            return Encoding.UTF8;
        }

        if (s32_Count >= 4 && u8_Buf[0] == 0 && u8_Buf[1] == 0 && u8_Buf[2] == 0xFE && u8_Buf[3] == 0xFF)
        {
            i_Stream.Position = 4;
            return new UTF32Encoding(true, true);
        }
    }

    i_Stream.Position = 0;
    return Encoding.Default;
}

如果我只有一个文件名,该如何使用这个函数呢? 我需要一个像这样的函数:public static Encoding DetectEncoding(string sFilename) - Michael Hutter
@MichaelHutter 使用 File.Open(sFilename) 获取文件流。然后继续操作。 - Alex from Jitbit
File.Open(sFilename)会打开一个文件,并根据文件内的BOM确定编码。如果BOM缺失,它可能会因为假设错误的编码而犯错。此答案也会犯同样的错误。只有在存在BOM的情况下才能正常工作。如果文件中没有BOM,则需要像这里所做的那样分析整个文件内容:https://dev59.com/U2855IYBdhLWcg3wIAka#69312696 - Michael Hutter
这个答案回答了Cedrik提出的问题。答案中没有错误。你的错误在于你没有仔细阅读问题。任何没有BOM的文件文本内容的检测都是不可靠的。 - Elmue

3

我使用Ude,它是Mozilla通用字符集检测器的C#移植版。它易于使用,并且能够给出一些非常好的结果。


2

1

针对所有德国人的解决方案 => ÄÖÜäöüß

此函数打开文件并通过BOM确定编码。
如果缺少BOM,则将文件解释为ANSI,但如果其中包含UTF8编码的德语Umlaute,则将其检测为UTF8。

private static Encoding GetEncoding(string sFileName)
{
    using (var reader = new StreamReader(sFileName, Encoding.Default, true))
    {
        string sContent = "";
        if (reader.Peek() >= 0) // you need this!
            sContent = reader.ReadToEnd();
        Encoding MyEncoding = reader.CurrentEncoding;
        if (MyEncoding == Encoding.Default) // Ansi detected (this happens if BOM is missing)
        { // Look, if there are typical UTF8 chars in this file...
            string sUmlaute = "ÄÖÜäöüß";
            bool bUTF8CharDetected = false;
            for (int z=0; z<sUmlaute.Length; z++)
            {
                string sUTF8Letter = sUmlaute.Substring(z, 1);
                string sUTF8LetterInAnsi = Encoding.Default.GetString(Encoding.UTF8.GetBytes(sUTF8Letter));
                if (sContent.Contains(sUTF8LetterInAnsi))
                {
                    bUTF8CharDetected = true;
                    break;
                }
            }
            if (bUTF8CharDetected) MyEncoding = Encoding.UTF8;
        }
        return MyEncoding;
    }
}


1

0
如果你的文件以字节60、118、56、46和49开头,那么你就有一个模棱两可的情况。它可能是UTF-8(无BOM)或任何单字节编码,如ASCII、ANSI、ISO-8859-1等。

嗯...所以我需要测试全部吗? - Cédric Boivin
这只是纯ASCII码。没有特殊字符的UTF-8等同于ASCII码,如果有特殊字符,则使用特定可检测的位模式。 - Nyerguds
@Nyerguds 可能不是这样。我有一个UTF-8文本文件(没有“特定可检测位模式” - 大多数都是英文字符)。如果我用ASCII读取它,就无法读取一个特定的“-”符号。 - Amit
不可能。如果字符不是ASCII,则将使用特定的可检测位模式进行编码;这就是UTF-8的工作原理。更有可能的是,您的文本既不是ASCII也不是UTF-8,而只是像Windows-1252这样的8位编码。 - Nyerguds

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接