这不是一个编程问题。是否有命令行或Windows工具(Windows 7)可以获取文本文件的当前编码?我肯定可以写一个小的C#应用程序,但我想知道是否已经内置了一些工具?
这不是一个编程问题。是否有命令行或Windows工具(Windows 7)可以获取文本文件的当前编码?我肯定可以写一个小的C#应用程序,但我想知道是否已经内置了一些工具?
寻找 Node.js/npm 解决方案吗?试试 encoding-checker:
npm install -g encoding-checker
Usage: encoding-checker [-p pattern] [-i encoding] [-v]
Options:
--help Show help [boolean]
--version Show version number [boolean]
--pattern, -p, -d [default: "*"]
--ignore-encoding, -i [default: ""]
--verbose, -v [default: false]
获取当前目录下所有文件的编码:
encoding-checker
返回当前目录下所有 md
文件的编码:
encoding-checker -p "*.md"
获取当前目录及其子文件夹中所有文件的编码(对于大型文件夹可能需要一些时间,会看似无响应):
encoding-checker -p "**"
文件编码检查器是一款图形界面工具,可验证一个或多个文件的文本编码。该工具可以显示所有选定文件的编码,或仅显示没有指定编码的文件。
文件编码检查器需要.NET 4或更高版本才能运行。
多年来,我们一直试图通过本机 CMD/Powershell 方法获取文件编码,但总是不得不求助于使用(和安装)第三方软件,如 Cygwin
、git-bash
和其他外部二进制文件,现在终于有了本机方法。
在此之前,人们一直在抱怨这种方法可能失败的各种方式,请理解这个工具主要用于识别文本、日志、CSV 和 TAB 类型的文件,而不是二进制文件。此外,文件编码大多是一个猜测的游戏,所以提供的脚本只是做了一些基本的猜测,在处理大文件时可能会失败。请随时测试并在代码片段中提供改进的反馈。
为了测试这个工具,我将一堆奇怪的垃圾文本转储到一个字符串中,然后使用可用的 Windows 编码进行导出。
ASCII, BigEndianUnicode, BigEndianUTF32, OEM, Unicode, UTF7, UTF8, UTF8BOM, UTF8NoBOM, UTF32
# The Garbage
$d=''; (33..126 && 161..252) | ForEach-Object { $c = $([char]$_); $d += ${c} }; $d = "1234 5678 ABCD EFGH`nCRLF: `r`nESC[ :`e[`nESC[m :`e[m`n`r`nASCII [22-126,161-252]:`n$d";
$elist=@('ASCII','BigEndianUnicode','BigEndianUTF32','OEM','Unicode','UTF7','UTF8','UTF8BOM','UTF8NoBOM','UTF32')
$elist | ForEach-Object { $ec=[string]($_); $fp = "zx_$ec.txt"; Write-Host -Fo DarkGray ("Encoding to file: {0}" -f $fp); $d | Out-File -Encoding $ec -FilePath $fp; }
# ls | encguess
ascii zx_ASCII.txt
utf-16 BE zx_BigEndianUnicode.txt
utf-32 BE zx_BigEndianUTF32.txt
OEM (finds) : (3)
OEM 437 ? zx_OEM.txt
utf-16 LE zx_Unicode.txt
utf-32 LE zx_UTF32.txt
utf-7 zx_UTF7.txt
utf-8 zx_UTF8.txt
utf-8 BOM zx_UTF8BOM.txt
utf-8 zx_UTF8NoBOM.txt
#!/usr/bin/env pwsh
# GuessFileEncoding.ps1 - Guess File Encoding for Windows-11 using Powershell
# -*- coding: utf-8 -*-
#------------------------------------------------------------------------------
# Author : not2qubit
# Date : 2023-11-27
# Version: : 1.0.0
# License: : CC-BY-SA-4.0
# URL: : https://gist.github.com/eabase/d4f16c8c6535f3868d5dfb1efbde0e5a
#--------------------------------------------------------
# Usage : ls | encguess
# : encguess .\somefile.txt
#--------------------------------------------------------
# References:
#
# [1] https://www.fileformat.info/info/charset/UTF-7/list.htm
# [2] https://learn.microsoft.com/en-gb/windows/win32/intl/code-page-identifiers
# [3] https://learn.microsoft.com/en-us/windows/console/console-virtual-terminal-sequences
# [4] https://gist.github.com/fnky/458719343aabd01cfb17a3a4f7296797
# [5] https://github.com/dankogai/p5-encode/blob/main/lib/Encode/Guess.pm
#
#--------------------------------------------------------
# https://dev59.com/GLvoa4cB1Zd3GeqP7aPe#62511302
Function Find-Bytes([byte[]]$Bytes, [byte[]]$Search, [int]$Start, [Switch]$All) {
For ($Index = $Start; $Index -le $Bytes.Length - $Search.Length ; $Index++) {
For ($i = 0; $i -lt $Search.Length -and $Bytes[$Index + $i] -eq $Search[$i]; $i++) {}
If ($i -ge $Search.Length) {
$Index
If (!$All) { Return }
}
}
}
function get_file_encoding {
param([Parameter(ValueFromPipeline=$True)] $filename)
begin {
# Use .NET to set current directory
[Environment]::CurrentDirectory = (pwd).path
}
process {
function guess_encoding ($bytes) {
# ---------------------------------------------------------------------------------------------------
# Plan: Do the easy checks first!
# 1. scan whole file & check if there are no codes above [1-127] and excess of "?" (0x3f) --> ASCII
# 2. scan whole file & check if there are codes above [1-127] --> ? ANSI/OEM/UTF-8
# 3. scan whole file & check if there are many codes "2b41" & char<127 --> UTF-7 --> "2b2f76" UTF-7 BOM
# 4. scan whole file & check if there are many codes "c2 | c3" --> UTF-8
# ---------------------------------------------------------------------------------------------------
switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
# 1. Check UTF-8 BOM
'^efbbbf' { return 'utf-8 BOM' } # UTF-8 BOM (?)
'^2b2f76' { return 'utf-7 BOM' } # UTF-7 BOM (65000)
# 2. Check UTF-32 (BE|LE)
'^fffe0000' { return 'utf-32 LE' } # UTF-32 LE (12000)
'^0000feff' { return 'utf-32 BE' } # UTF-32 BE (12001) 'bigendianutf32'
# 3. Check UTF-16 (BE|LE)
'^fffe' { return 'utf-16 LE' } # UTF-16 LE (1200) 'unicode'
'^feff' { return 'utf-16 BE' } # UTF-16 BE (1201) 'bigendianunicode'
default { return 'unsure' } #
}
}
function guess_again ($blob) {
#-------------------------------
# 1. Check if ASCII [0-127] (7-bit)
#-------------------------------
# (a) Check if using ASCII above 127
$guess_ascii = 1
foreach ($i in $blob) { if ($i -gt 127) { $guess_ascii=0; break; } }
# (b) Check if there are many consecutive "?"s.
# That would indicate having erroneously saved a
# ISO-8859-1 character containing file, as ASCII.
#$b = [byte[]]("????".ToCharArray())
#$n = (Find-Bytes -all $blob $b).Count
#if ($n -gt 4) {}
#-------------------------------
# 2. Check for UTF-7 strings "2b41" (43 65)
#-------------------------------
$b = [byte[]]("+A".ToCharArray())
$finds=(Find-Bytes -all $blob $b).Count
$quart = [math]::Round(($blob.length)*0.05)
#Write-Host -Fo DarkGray " UTF-7 (quart,finds) : (${quart},${finds})"
if ( ($finds -gt 10) -And ($guess_ascii -eq 1) ) {
return 'utf-7'
} elseif ($guess_ascii -eq 1) {
return 'ascii'
}
#-------------------------------
# 3. Check for UTF-8 strings "c2|c3" (194,195)
#-------------------------------
# If > 25% are c2|c3, probably utf-8
$b = [byte[]](0xc2)
$c = [byte[]](0xc3)
$f1=(Find-Bytes -all $blob $b).Count
$f2=(Find-Bytes -all $blob $c).Count
$quart = [math]::Round(($blob.length)*0.25)
$finds = ($f1 + $f2)
if ($finds -gt $quart) { return "utf-8" }
#-------------------------------
# 4. Check for OEM Strings:
#-------------------------------
# Check for "4x" sequences of 'AAAA'(41), 'IIII'(49), 'OOOO'(4f)
$n = 0
#$oemlist = @(65,73,79)
$oemlist = @('A','I','O')
#$b = [byte[]](("$i"*4).ToCharArray())
foreach ($i in $oemlist) {$b = [byte[]](("$i"*4).ToCharArray()); $n += (Find-Bytes -all $blob $b).Count }
#$blob | Group-Object | Select Name, Count | Sort -Top 15 -Descending Count
Write-Host -Fo DarkGray " OEM (finds) : ($n)"
if ($n -ge 3) { return "OEM 437 ?" }
return "unknown"
}
$bytes = [byte[]](Get-Content $filename -AsByteStream -ReadCount 4 -TotalCount 4)
if (!$bytes) {
$guess = 'failed'
} else {
$guess = guess_encoding($bytes)
}
if ($guess -eq 'unsure') {
# 28591 iso-8859-1 Western European (ISO) // Windows-1252
$blob = [byte[]](Get-Content $filename -AsByteStream -ReadCount 0)
$guess = guess_again($blob)
}
$name = $filename.Name
Write-Host -Fo White (" {0,-16}" -f $guess) -Non; Write-Host -Fo DarkYellow "$name"
}
}
Set-Alias encguess get_file_encoding