逐行读取文本文件，并报告精确的偏移/位置

Question

逐行读取文本文件，并报告精确的偏移/位置

12

我的简单需求：读取一个巨大（>一百万行）的测试文件（对于这个示例，假设它是某种CSV文件），并保留对该行开始位置的引用，以便将来更快地查找它（从X处开始读取一行）。

我首先尝试了朴素和简单的方法，使用 StreamWriter 并访问底层的 BaseStream.Position。不幸的是，这并不像我预期的那样工作：

给定一个包含以下内容的文件:

Foo
Bar
Baz
Bla
Fasel

还有这段非常简单的代码

using (var sr = new StreamReader(@"C:\Temp\LineTest.txt")) {
  string line;
  long pos = sr.BaseStream.Position;
  while ((line = sr.ReadLine()) != null) {
    Console.Write("{0:d3} ", pos);
    Console.WriteLine(line);
    pos = sr.BaseStream.Position;
  }
}

输出结果为：

000 Foo
025 Bar
025 Baz
025 Bla
025 Fasel

我可以想象流正在尝试提高效率，并在需要新数据时读取（大）块。对我来说这是不好的..问题是：有没有办法在逐行读取文件时获取（字节，字符）偏移量，而不使用基本流并手动处理 \r \n \r\n 和字符串编码等？这并不是什么大问题，我只是不喜欢构建可能已经存在的东西.

- Benjamin Podszun

如果你反射 System.IO.Stream 类，最小允许的缓冲区是 128 字节...不确定这是否有帮助，但在我尝试较长的文件时，这是我能够得到的最短位置。 - Nathan Wheeler

5个回答

阿里云服务器只需要99元/年，新老用户同享，点击查看详情

5

经过搜索、测试和一些疯狂的尝试，这是我解决问题的代码（我目前正在我的产品中使用此代码）。

public sealed class TextFileReader : IDisposable
{

    FileStream _fileStream = null;
    BinaryReader _binReader = null;
    StreamReader _streamReader = null;
    List<string> _lines = null;
    long _length = -1;

    /// <summary>
    /// Initializes a new instance of the <see cref="TextFileReader"/> class with default encoding (UTF8).
    /// </summary>
    /// <param name="filePath">The path to text file.</param>
    public TextFileReader(string filePath) : this(filePath, Encoding.UTF8) { }

    /// <summary>
    /// Initializes a new instance of the <see cref="TextFileReader"/> class.
    /// </summary>
    /// <param name="filePath">The path to text file.</param>
    /// <param name="encoding">The encoding of text file.</param>
    public TextFileReader(string filePath, Encoding encoding)
    {
        if (!File.Exists(filePath))
            throw new FileNotFoundException("File (" + filePath + ") is not found.");

        _fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read);
        _length = _fileStream.Length;
        _binReader = new BinaryReader(_fileStream, encoding);
    }

    /// <summary>
    /// Reads a line of characters from the current stream at the current position and returns the data as a string.
    /// </summary>
    /// <returns>The next line from the input stream, or null if the end of the input stream is reached</returns>
    public string ReadLine()
    {
        if (_binReader.PeekChar() == -1)
            return null;

        string line = "";
        int nextChar = _binReader.Read();
        while (nextChar != -1)
        {
            char current = (char)nextChar;
            if (current.Equals('\n'))
                break;
            else if (current.Equals('\r'))
            {
                int pickChar = _binReader.PeekChar();
                if (pickChar != -1 && ((char)pickChar).Equals('\n'))
                    nextChar = _binReader.Read();
                break;
            }
            else
                line += current;
            nextChar = _binReader.Read();
        }
        return line;
    }

    /// <summary>
    /// Reads some lines of characters from the current stream at the current position and returns the data as a collection of string.
    /// </summary>
    /// <param name="totalLines">The total number of lines to read (set as 0 to read from current position to end of file).</param>
    /// <returns>The next lines from the input stream, or empty collectoin if the end of the input stream is reached</returns>
    public List<string> ReadLines(int totalLines)
    {
        if (totalLines < 1 && this.Position == 0)
            return this.ReadAllLines();

        _lines = new List<string>();
        int counter = 0;
        string line = this.ReadLine();
        while (line != null)
        {
            _lines.Add(line);
            counter++;
            if (totalLines > 0 && counter >= totalLines)
                break;
            line = this.ReadLine();
        }
        return _lines;
    }

    /// <summary>
    /// Reads all lines of characters from the current stream (from the begin to end) and returns the data as a collection of string.
    /// </summary>
    /// <returns>The next lines from the input stream, or empty collectoin if the end of the input stream is reached</returns>
    public List<string> ReadAllLines()
    {
        if (_streamReader == null)
            _streamReader = new StreamReader(_fileStream);
        _streamReader.BaseStream.Seek(0, SeekOrigin.Begin);
        _lines = new List<string>();
        string line = _streamReader.ReadLine();
        while (line != null)
        {
            _lines.Add(line);
            line = _streamReader.ReadLine();
        }
        return _lines;
    }

    /// <summary>
    /// Gets the length of text file (in bytes).
    /// </summary>
    public long Length
    {
        get { return _length; }
    }

    /// <summary>
    /// Gets or sets the current reading position.
    /// </summary>
    public long Position
    {
        get
        {
            if (_binReader == null)
                return -1;
            else
                return _binReader.BaseStream.Position;
        }
        set
        {
            if (_binReader == null)
                return;
            else if (value >= this.Length)
                this.SetPosition(this.Length);
            else
                this.SetPosition(value);
        }
    }

    void SetPosition(long position)
    {
        _binReader.BaseStream.Seek(position, SeekOrigin.Begin);
    }

    /// <summary>
    /// Gets the lines after reading.
    /// </summary>
    public List<string> Lines
    {
        get
        {
            return _lines;
        }
    }

    /// <summary>
    /// Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources.
    /// </summary>
    public void Dispose()
    {
        if (_binReader != null)
            _binReader.Close();
        if (_streamReader != null)
        {
            _streamReader.Close();
            _streamReader.Dispose();
        }
        if (_fileStream != null)
        {
            _fileStream.Close();
            _fileStream.Dispose();
        }
    }

    ~TextFileReader()
    {
        this.Dispose();
    }
}

- Quynh Nguyen

3

这是一个非常棘手的问题。在互联网上进行了非常长而繁琐的不同解决方案的枚举后（其中包括来自本帖的解决方案，谢谢！），我不得不自己创造新的方法。我的要求如下：

Performance - reading must be very fast, so reading one char at the time or using reflection are not acceptable, so buffering is required
Streaming - file can be huge, so it is not acceptable to read it to memory entirely
Tailing - file tailing should be available
Long lines - lines can be very long, so buffer can't be limited

Stable - single byte error was immediately visible during usage. Unfortunately for me, several implementations I found were with stability problems

public class OffsetStreamReader
{
    private const int InitialBufferSize = 4096;    
    private readonly char _bom;
    private readonly byte _end;
    private readonly Encoding _encoding;
    private readonly Stream _stream;
    private readonly bool _tail;

    private byte[] _buffer;
    private int _processedInBuffer;
    private int _informationInBuffer;

    public OffsetStreamReader(Stream stream, bool tail)
    {
        _buffer = new byte[InitialBufferSize];
        _processedInBuffer = InitialBufferSize;

        if (stream == null || !stream.CanRead)
            throw new ArgumentException("stream");

        _stream = stream;
        _tail = tail;
        _encoding = Encoding.UTF8;

        _bom = '\uFEFF';
        _end = _encoding.GetBytes(new [] {'\n'})[0];
    }

    public long Offset { get; private set; }

    public string ReadLine()
    {
        // Underlying stream closed
        if (!_stream.CanRead)
            return null;

        // EOF
        if (_processedInBuffer == _informationInBuffer)
        {
            if (_tail)
            {
                _processedInBuffer = _buffer.Length;
                _informationInBuffer = 0;
                ReadBuffer();
            }

            return null;
        }

        var lineEnd = Search(_buffer, _end, _processedInBuffer);
        var haveEnd = true;

        // File ended but no finalizing newline character
        if (lineEnd.HasValue == false && _informationInBuffer + _processedInBuffer < _buffer.Length)
        {
            if (_tail)
                return null;
            else
            {
                lineEnd = _informationInBuffer;
                haveEnd = false;
            }
        }

        // No end in current buffer
        if (!lineEnd.HasValue)
        {
            ReadBuffer();
            if (_informationInBuffer != 0)
                return ReadLine();

            return null;
        }

        var arr = new byte[lineEnd.Value - _processedInBuffer];
        Array.Copy(_buffer, _processedInBuffer, arr, 0, arr.Length);

        Offset = Offset + lineEnd.Value - _processedInBuffer + (haveEnd ? 1 : 0);
        _processedInBuffer = lineEnd.Value + (haveEnd ? 1 : 0);

        return _encoding.GetString(arr).TrimStart(_bom).TrimEnd('\r', '\n');
    }

    private void ReadBuffer()
    {
        var notProcessedPartLength = _buffer.Length - _processedInBuffer;

        // Extend buffer to be able to fit whole line to the buffer
        // Was     [NOT_PROCESSED]
        // Become  [NOT_PROCESSED        ]
        if (notProcessedPartLength == _buffer.Length)
        {
            var extendedBuffer = new byte[_buffer.Length + _buffer.Length/2];
            Array.Copy(_buffer, extendedBuffer, _buffer.Length);
            _buffer = extendedBuffer;
        }

        // Copy not processed information to the begining
        // Was    [PROCESSED NOT_PROCESSED]
        // Become [NOT_PROCESSED          ]
        Array.Copy(_buffer, (long) _processedInBuffer, _buffer, 0, notProcessedPartLength);

        // Read more information to the empty part of buffer
        // Was    [ NOT_PROCESSED                   ]
        // Become [ NOT_PROCESSED NEW_NOT_PROCESSED ]
        _informationInBuffer = notProcessedPartLength + _stream.Read(_buffer, notProcessedPartLength, _buffer.Length - notProcessedPartLength);

        _processedInBuffer = 0;
    }

    private int? Search(byte[] buffer, byte byteToSearch, int bufferOffset)
    {
        for (int i = bufferOffset; i < buffer.Length - 1; i++)
        {
            if (buffer[i] == byteToSearch)
                return i;
        }
        return null;
    }
}

- Anton

1

我有一个日志文件，使用offsetreader读取时会导致它进入无限循环... - rekna

你能以某种方式分享那个文件吗？ - Anton

2

尽管Thomas Levesque 的解决方案很好，这里是我的解决方案。它使用反射，因此速度会慢一些，但它独立于编码。此外，我还添加了 Seek 扩展。

/// <summary>Useful <see cref="StreamReader"/> extentions.</summary>
public static class StreamReaderExtentions
{
    /// <summary>Gets the position within the <see cref="StreamReader.BaseStream"/> of the <see cref="StreamReader"/>.</summary>
    /// <remarks><para>This method is quite slow. It uses reflection to access private <see cref="StreamReader"/> fields. Don't use it too often.</para></remarks>
    /// <param name="streamReader">Source <see cref="StreamReader"/>.</param>
    /// <exception cref="ArgumentNullException">Occurs when passed <see cref="StreamReader"/> is null.</exception>
    /// <returns>The current position of this stream.</returns>
    public static long GetPosition(this StreamReader streamReader)
    {
        if (streamReader == null)
            throw new ArgumentNullException("streamReader");

        var charBuffer = (char[])streamReader.GetType().InvokeMember("charBuffer", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);
        var charPos = (int)streamReader.GetType().InvokeMember("charPos", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);
        var charLen = (int)streamReader.GetType().InvokeMember("charLen", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);

        var offsetLength = streamReader.CurrentEncoding.GetByteCount(charBuffer, charPos, charLen - charPos);

        return streamReader.BaseStream.Position - offsetLength;
    }

    /// <summary>Sets the position within the <see cref="StreamReader.BaseStream"/> of the <see cref="StreamReader"/>.</summary>
    /// <remarks>
    /// <para><see cref="StreamReader.BaseStream"/> should be seekable.</para>
    /// <para>This method is quite slow. It uses reflection and flushes the charBuffer of the <see cref="StreamReader.BaseStream"/>. Don't use it too often.</para>
    /// </remarks>
    /// <param name="streamReader">Source <see cref="StreamReader"/>.</param>
    /// <param name="position">The point relative to origin from which to begin seeking.</param>
    /// <param name="origin">Specifies the beginning, the end, or the current position as a reference point for origin, using a value of type <see cref="SeekOrigin"/>. </param>
    /// <exception cref="ArgumentNullException">Occurs when passed <see cref="StreamReader"/> is null.</exception>
    /// <exception cref="ArgumentException">Occurs when <see cref="StreamReader.BaseStream"/> is not seekable.</exception>
    /// <returns>The new position in the stream. This position can be different to the <see cref="position"/> because of the preamble.</returns>
    public static long Seek(this StreamReader streamReader, long position, SeekOrigin origin)
    {
        if (streamReader == null)
            throw new ArgumentNullException("streamReader");

        if (!streamReader.BaseStream.CanSeek)
            throw new ArgumentException("Underlying stream should be seekable.", "streamReader");

        var preamble = (byte[])streamReader.GetType().InvokeMember("_preamble", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);
        if (preamble.Length > 0 && position < preamble.Length) // preamble or BOM must be skipped
            position += preamble.Length;

        var newPosition = streamReader.BaseStream.Seek(position, origin); // seek
        streamReader.DiscardBufferedData(); // this updates the buffer

        return newPosition;
    }
}

- Sergey Alekseev

-1

这个可以行吗：

using (var sr = new StreamReader(@"C:\Temp\LineTest.txt")) {
  string line;
  long pos = 0;
  while ((line = sr.ReadLine()) != null) {
    Console.Write("{0:d3} ", pos);
    Console.WriteLine(line);
    pos += line.Length;
  }
}

- Sani Huttunen

很遗憾，我不能这样做，因为我必须接受不同类型的换行符（比如 \n、\r\n 和 \r），这会导致数字偏差。如果我坚持要有一个_一致的_换行符分隔符（实际上可能会混合使用），并且在开始之前进行探测以了解真正的偏移量，那么这种方法可能有效。所以 - 我正在尝试避免走这条路。 - Benjamin Podszun

@Benjamin：该死 - 我刚才发布了一个类似的答案，它明确依赖于一致的换行符分隔符... - Jon Skeet

那么我认为最好手动使用StreamReader.Read()来完成。 - Sani Huttunen

@Jon：呵呵。正如我所说：这可能是一种方法，而不是使用普通的Stream——如果这是我唯一的两个选择，那么我必须掷骰子并接受后果：要么是一致的分隔符（对于在多个平台上处理、复制/粘贴在糟糕的编辑器中等文件来说很糟糕），要么就是Stream的东西（无聊的低级行解析和字符串编码混乱，看起来回报很低需要大量样板代码）。 - Benjamin Podszun

那并没有什么帮助。我必须放弃整个 StreamReader。即使在其上执行 Read()，也会导致基础流的阻塞读取，并将 BaseStream.Position 移动到我的示例中的 25。只有一个字符之后。 - Benjamin Podszun

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，

- Thomas Levesque · Accepted Answer

您可以创建一个TextReader包装器，该包装器将跟踪基本TextReader中的当前位置：

public class TrackingTextReader : TextReader
{
    private TextReader _baseReader;
    private int _position;

    public TrackingTextReader(TextReader baseReader)
    {
        _baseReader = baseReader;
    }

    public override int Read()
    {
        _position++;
        return _baseReader.Read();
    }

    public override int Peek()
    {
        return _baseReader.Peek();
    }

    public int Position
    {
        get { return _position; }
    }
}

然后您可以按如下方式使用它：

string text = @"Foo
Bar
Baz
Bla
Fasel";

using (var reader = new StringReader(text))
using (var trackingReader = new TrackingTextReader(reader))
{
    string line;
    while ((line = trackingReader.ReadLine()) != null)
    {
        Console.WriteLine("{0:d3} {1}", trackingReader.Position, line);
    }
}