我写了一个小的解析程序来比较旧的System.IO.Stream
和新的System.IO.Pipelines
在.NET Core中的性能。我期望管道代码的速度与或更快。然而,它慢了约40%。
程序很简单:它在一个100Mb的文本文件中搜索关键字,并返回关键字所在行的行号。下面是Stream版本:
public static async Task<int> GetLineNumberUsingStreamAsync(
string file,
string searchWord)
{
using var fileStream = File.OpenRead(file);
using var lines = new StreamReader(fileStream, bufferSize: 4096);
int lineNumber = 1;
// ReadLineAsync returns null on stream end, exiting the loop
while (await lines.ReadLineAsync() is string line)
{
if (line.Contains(searchWord))
return lineNumber;
lineNumber++;
}
return -1;
}
我认为以上流代码比下面的管道代码要慢,因为流代码在StreamReader中将字节编码为字符串。管道代码通过对字节进行操作来避免这种情况:
public static async Task<int> GetLineNumberUsingPipeAsync(string file, string searchWord)
{
var searchBytes = Encoding.UTF8.GetBytes(searchWord);
using var fileStream = File.OpenRead(file);
var pipe = PipeReader.Create(fileStream, new StreamPipeReaderOptions(bufferSize: 4096));
var lineNumber = 1;
while (true)
{
var readResult = await pipe.ReadAsync().ConfigureAwait(false);
var buffer = readResult.Buffer;
if(TryFindBytesInBuffer(ref buffer, searchBytes, ref lineNumber))
{
return lineNumber;
}
pipe.AdvanceTo(buffer.End);
if (readResult.IsCompleted) break;
}
await pipe.CompleteAsync();
return -1;
}
以下是相关的辅助方法:
/// <summary>
/// Look for `searchBytes` in `buffer`, incrementing the `lineNumber` every
/// time we find a new line.
/// </summary>
/// <returns>true if we found the searchBytes, false otherwise</returns>
static bool TryFindBytesInBuffer(
ref ReadOnlySequence<byte> buffer,
in ReadOnlySpan<byte> searchBytes,
ref int lineNumber)
{
var bufferReader = new SequenceReader<byte>(buffer);
while (TryReadLine(ref bufferReader, out var line))
{
if (ContainsBytes(ref line, searchBytes))
return true;
lineNumber++;
}
return false;
}
static bool TryReadLine(
ref SequenceReader<byte> bufferReader,
out ReadOnlySequence<byte> line)
{
var foundNewLine = bufferReader.TryReadTo(out line, (byte)'\n', advancePastDelimiter: true);
if (!foundNewLine)
{
line = default;
return false;
}
return true;
}
static bool ContainsBytes(
ref ReadOnlySequence<byte> line,
in ReadOnlySpan<byte> searchBytes)
{
return new SequenceReader<byte>(line).TryReadTo(out var _, searchBytes);
}
我在上面使用SequenceReader<byte>
,因为我的理解是它比ReadOnlySequence<byte>
更智能/更快;它对于可以操作单个Span<byte>
时有一个快速路径。
这里是基准测试结果(.NET Core 3.1)。完整的代码和BenchmarkDotNet结果可在此存储库中找到。
- GetLineNumberWithStreamAsync - 435.6 ms,分配了 366.19 MB
- GetLineNumberUsingPipeAsync - 619.8 ms,分配了 9.28 MB
在管道代码中我做错了什么吗?
更新:Evk已回答了这个问题。应用他的修复后,这里是新的基准测试数字:
- GetLineNumberWithStreamAsync - 452.2 ms,分配了 366.19 MB
- GetLineNumberWithPipeAsync - 203.8 ms,分配了 9.28 MB