Get-Content
在读取大文件时效率非常低。而 Sort-Object
也不是很快。
让我们先建立一个基准:
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$c = Get-Content .\log3.txt -Encoding Ascii
$sw.Stop();
Write-Output ("Reading took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$s = $c | Sort-Object;
$sw.Stop();
Write-Output ("Sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u = $s | Get-Unique
$sw.Stop();
Write-Output ("uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$u | Out-File 'result.txt' -Encoding ascii
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
使用一个大小为40MB,由100,000个唯一行重复16次组成的文本文件(共1.6百万行),在我的机器上运行此脚本将产生以下输出:
Reading took 00:02:16.5768663
Sorting took 00:02:04.0416976
uniq took 00:01:41.4630661
saving took 00:00:37.1630663
完全没有印象:用超过6分钟的时间来排序微小的文件。每个步骤都可以有很大的改进空间。让我们使用
StreamReader
逐行读取文件到
HashSet
中去除重复项,然后将数据复制到
List
中进行排序,最后使用
StreamWriter
将结果输出。
$hs = new-object System.Collections.Generic.HashSet[string]
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$reader = [System.IO.File]::OpenText("D:\log3.txt")
try {
while (($line = $reader.ReadLine()) -ne $null)
{
$t = $hs.Add($line)
}
}
finally {
$reader.Close()
}
$sw.Stop();
Write-Output ("read-uniq took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
$ls = new-object system.collections.generic.List[string] $hs;
$ls.Sort();
$sw.Stop();
Write-Output ("sorting took {0}" -f $sw.Elapsed);
$sw = [System.Diagnostics.Stopwatch]::StartNew();
try
{
$f = New-Object System.IO.StreamWriter "d:\result2.txt";
foreach ($s in $ls)
{
$f.WriteLine($s);
}
}
finally
{
$f.Close();
}
$sw.Stop();
Write-Output ("saving took {0}" -f $sw.Elapsed);
这个脚本会生成:
read-uniq took 00:00:32.2225181
sorting took 00:00:00.2378838
saving took 00:00:01.0724802
在相同的输入文件上,它的运行速度比之前快了10倍以上。尽管从磁盘读取文件需要30秒钟,但我仍然感到惊讶。
sorted.txt
)的大小是源文件的两倍。 - Predrag Vasić| Set-Content sorted.txt
的内容替换> sorted.txt
可能会起作用,否则您可以尝试| Out-File sorted.txt -Encoding <your choice>
。 - notjustmegc file.txt | sort | get-unique
和gc file.txt | sort -Unique
两个版本的性能,结果发现第二个版本更快(我猜测是因为去除了额外管道的开销)。 - E.Z. Hart