是的,这是一个非常频繁的问题,而且对我来说这个问题很模糊,因为我对它不是很了解。
但我想要一种非常精确的方法来查找文件的编码方式,就像Notepad++那样精确。
是的,这是一个非常频繁的问题,而且对我来说这个问题很模糊,因为我对它不是很了解。
但我想要一种非常精确的方法来查找文件的编码方式,就像Notepad++那样精确。
StreamReader.CurrentEncoding
属性很少返回正确的文本文件编码。我更成功的方法是通过分析其字节顺序标记(BOM)确定文件的字节顺序,以确定文件的字节顺序。如果文件没有BOM,则无法确定文件的编码。
*更新于2020年4月8日,包括UTF-32LE检测并返回UTF-32BE的正确编码
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
// Read the BOM
var bom = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(bom, 0, 4);
}
// Analyze the BOM
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true); //UTF-32BE
// We actually have no idea what the encoding is if we reach this point, so
// you may wish to return null instead of defaulting to ASCII
return Encoding.ASCII;
}
StreamReader
的参考来源,该实现更符合大多数人的需求。他们会创建新的编码,而不是使用现有的Encoding.Unicode
对象,因此相等性检查将失败(这可能很少发生,因为例如Encoding.UTF8
可能返回不同的对象),但它(1)不使用非常奇怪的UTF-7格式,(2)如果未找到BOM,则默认为UTF-8,(3)可以被覆盖以使用不同的默认编码。 - hangar00 00 FE FF
)时,你返回了系统提供的Encoding.UTF32
,这是一种小端编码方式(正如在这里所指出的)。另外,就像@Nyerguds所指出的那样,你仍然没有寻找UTF32LE,它的标识为FF FE 00 00
(参见https://en.wikipedia.org/wiki/Byte_order_mark)。正如该用户所指出的,因为它是包含关系,所以该检查必须在2字节检查之前进行。 - Glenn Slayden以下代码对我而言完全正常,使用StreamReader
类:
using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))
{
reader.Peek(); // you need this!
var encoding = reader.CurrentEncoding;
}
诀窍在于使用Peek
调用,否则,.NET什么也没做(并且它没有读取前导信息或BOM)。当然,如果您在检查编码之前使用任何其他的ReadXXX
调用,也可以工作。
如果文件没有BOM,则将使用defaultEncodingIfNoBom
编码。还有一个没有此参数的StreamReader
构造函数重载(在这种情况下,在任何读取之前,编码通常会被设置为UTF8),但我建议在您的上下文中定义默认编码。
我已经成功地测试过带有UTF8、UTF16/Unicode(LE和BE)以及UTF32(LE和BE)BOM的文件。但对于UTF7则不行。
foreach($filename in $args) { $reader = [System.IO.StreamReader]::new($filename, [System.Text.Encoding]::default,$true); $peek = $reader.Peek(); $reader.currentencoding | select bodyname,encodingname; $reader.close() }
- js2010UTF-8 without BOM
,则此方法无效。 - Ozkannew StreamReader(@"C:\Temp\File without BOM.txt", true).CurrentEncoding.EncodingName
返回 Unicode (UTF-8)
。 - Maxence提供@CodesInChaos提出的步骤的实现细节:
1)检查是否存在字节顺序标记
2)检查文件是否为有效的UTF8编码
3)使用本地的“ANSI”代码页(按Microsoft定义的ANSI)
第二步有效是因为在UTF8之外的代码页中,大多数非ASCII序列都不是有效的UTF8格式。https://dev59.com/U2855IYBdhLWcg3wIAka#4522251 更详细地解释了这个策略。
using System; using System.IO; using System.Text;
// Using encoding from BOM or UTF8 if no BOM found,
// check if the file is valid, by reading all lines
// If decoding fails, use the local "ANSI" codepage
public string DetectFileEncoding(Stream fileStream)
{
var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());
using (var reader = new StreamReader(fileStream, Utf8EncodingVerifier,
detectEncodingFromByteOrderMarks: true, leaveOpen: true, bufferSize: 1024))
{
string detectedEncoding;
try
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
}
detectedEncoding = reader.CurrentEncoding.BodyName;
}
catch (Exception e)
{
// Failed to decode the file using the BOM/UT8.
// Assume it's local ANSI
detectedEncoding = "ISO-8859-1";
}
// Rewind the stream
fileStream.Seek(0, SeekOrigin.Begin);
return detectedEncoding;
}
}
[Test]
public void Test1()
{
Stream fs = File.OpenRead(@".\TestData\TextFile_ansi.csv");
var detectedEncoding = DetectFileEncoding(fs);
using (var reader = new StreamReader(fs, Encoding.GetEncoding(detectedEncoding)))
{
// Consume your file
var line = reader.ReadLine();
...
reader.Peek()
而不是 `while (!reader.EndOfStream)
{
var line = reader.ReadLine();
}`
- Harison Silvareader.Peek()
不会读取整个流。我发现在处理大型流时,使用 Peek()
是不够的。所以我使用了 reader.ReadToEndAsync()
。 - Gary Pendleburyvar Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());
它在读取一行时的 try
块中使用。如果编码器无法解析提供的文本(即文本未使用utf8进行编码),则Utf8EncodingVerifier将抛出异常。然后捕获该异常,我们就知道该文本不是utf8,并默认为ISO-8859-1。 - Berthier Lemieux请查看这个。
这是Mozilla通用字符集检测器的一个移植版本,您可以像这样使用它...
public static void Main(String[] args)
{
string filename = args[0];
using (FileStream fs = File.OpenRead(filename)) {
Ude.CharsetDetector cdet = new Ude.CharsetDetector();
cdet.Feed(fs);
cdet.DataEnd();
if (cdet.Charset != null) {
Console.WriteLine("Charset: {0}, confidence: {1}",
cdet.Charset, cdet.Confidence);
} else {
Console.WriteLine("Detection failed.");
}
}
}
我会尝试以下步骤:
1) 检查是否存在字节顺序标记
2) 检查文件是否有效的UTF8格式
3) 使用本地的“ANSI”代码页面(即Microsoft定义的ANSI)
第2步有效是因为大多数非ASCII序列在除UTF8以外的码页上都不是有效的UTF8。
Utf8Encoding
的实例时,您可以传入一个额外的参数,用于确定是否应抛出异常,或者如果您更喜欢静默数据损坏,则选择该参数。 - CodesInChaos.NET并不是很有帮助,但你可以尝试以下算法:
这里是调用:
var encoding = FileHelper.GetEncoding(filePath);
if (encoding == null)
throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");
这是代码:
public class FileHelper
{
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings
/// Defaults to UTF8 when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding or null.</returns>
public static Encoding GetEncoding(string filename)
{
var encodingByBOM = GetEncodingByBOM(filename);
if (encodingByBOM != null)
return encodingByBOM;
// BOM not found :(, so try to parse characters into several encodings
var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8);
if (encodingByParsingUTF8 != null)
return encodingByParsingUTF8;
var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1"));
if (encodingByParsingLatin1 != null)
return encodingByParsingLatin1;
var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7);
if (encodingByParsingUTF7 != null)
return encodingByParsingUTF7;
return null; // no encoding found
}
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM)
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
private static Encoding GetEncodingByBOM(string filename)
{
// Read the BOM
var byteOrderMark = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(byteOrderMark, 0, 4);
}
// Analyze the BOM
if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7;
if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8;
if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32;
return null; // no BOM found
}
private static Encoding GetEncodingByParsing(string filename, Encoding encoding)
{
var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback());
try
{
using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true))
{
while (!textReader.EndOfStream)
{
textReader.ReadLine(); // in order to increment the stream position
}
// all text parsed ok
return textReader.CurrentEncoding;
}
}
catch (Exception ex) { }
return null; //
}
}
Microsoft.ProgramSynthesis.Detection
(目前版本为8.17.0)。EncodingTypeUtils.GetDotNetName
而不是使用switch来获取Encoding
实例:using System.Text;
using Microsoft.ProgramSynthesis.Detection.Encoding;
...
public Encoding? DetectEncoding(Stream stream)
{
try
{
if (stream.CanSeek)
{
// Read from the beginning if possible
stream.Seek(0, SeekOrigin.Begin);
}
// Detect encoding type (enum)
var encodingType = EncodingIdentifier.IdentifyEncoding(stream);
// Get the corresponding encoding name to be passed to System.Text.Encoding.GetEncoding
var encodingDotNetName = EncodingTypeUtils.GetDotNetName(encodingType);
if (!string.IsNullOrEmpty(encodingDotNetName))
{
return Encoding.GetEncoding(encodingDotNetName);
}
}
catch (Exception e)
{
// Handle exception (log, throw, etc...)
}
// In case of error return null or a default value
return null;
}
查找 C# 相关内容请点击
https://docs.microsoft.com/zh-cn/dotnet/api/system.io.streamreader.currentencoding?view=net-5.0
string path = @"path\to\your\file.ext";
using (StreamReader sr = new StreamReader(path, true))
{
while (sr.Peek() >= 0)
{
Console.Write((char)sr.Read());
}
//Test for the encoding after reading, or at least
//after the first read.
Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding);
Console.ReadLine();
Console.WriteLine();
}
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in Get-ChildItem .\ -Recurse -include *.cpp,*.h, *.ml) {
$openUTF = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::UTF8)
$contentUTF = $openUTF.ReadToEnd()
[regex]$regex = '�'
$c=$regex.Matches($contentUTF).count
$openUTF.Close()
if ($c -ne 0) {
$openLatin1 = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('ISO-8859-1'))
$contentLatin1 = $openLatin1.ReadToEnd()
$openLatin1.Close()
[regex]$regex = '[\x7F-\xAF]'
$c=$regex.Matches($contentLatin1).count
if ($c -eq 0) {
[System.IO.File]::WriteAllLines($i, $contentLatin1, $Utf8NoBomEncoding)
$i.FullName
}
else {
$openGB = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('GB18030'))
$contentGB = $openGB.ReadToEnd()
$openGB.Close()
[System.IO.File]::WriteAllLines($i, $contentGB, $Utf8NoBomEncoding)
$i.FullName
}
}
}
Write-Host -NoNewLine 'Press any key to continue...';
$null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown');
这个看起来很不错。
首先创建一个辅助方法:
private static Encoding TestCodePage(Encoding testCode, byte[] byteArray)
{
try
{
var encoding = Encoding.GetEncoding(testCode.CodePage, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
var a = encoding.GetCharCount(byteArray);
return testCode;
}
catch (Exception e)
{
return null;
}
}
然后创建测试源代码。在这种情况下,我有一个需要获取编码的字节数组:
public static Encoding DetectCodePage(byte[] contents)
{
if (contents == null || contents.Length == 0)
{
return Encoding.Default;
}
return TestCodePage(Encoding.UTF8, contents)
?? TestCodePage(Encoding.Unicode, contents)
?? TestCodePage(Encoding.BigEndianUnicode, contents)
?? TestCodePage(Encoding.GetEncoding(1252), contents) // Western European
?? TestCodePage(Encoding.GetEncoding(28591), contents) // ISO Western European
?? TestCodePage(Encoding.ASCII, contents)
?? TestCodePage(Encoding.Default, contents); // likely Unicode
}