如何在C#中搜索大型文本文件中的字符串而不需要逐行搜索？

Question

如何在C#中搜索大型文本文件中的字符串而不需要逐行搜索？

c#searchtext

14

我有一个大型文本文件，需要搜索特定的字符串。是否有一种快速的方法可以不逐行读取而完成？

由于文件大小超过100MB，这种方法非常缓慢。

- Justin984

6

你有对你的程序进行过剖析吗？ - Lasse V. Karlsen

5

这个文件经常变动吗，还是静态的？如果它是静态的，你可以运用离线算法并进行索引，这样在运行时就可以快速到达所需文档的子部分。 - Polaris878

我看到很多建议将文件分段读入内存，但是如果搜索术语的起始点在一个文件段中，而结束点在另一个文件段中，该如何处理呢？也许可以加载重叠的文件段，如果出现这种情况，则下一个读取的块应包含整个术语。 - ProfK

14个回答

5

这是我的解决方案，使用流一次读取一个字符。我创建了一个自定义类来逐个字符搜索值，直到找到整个值。

我在网络驱动器上测试了一个100MB的文件，并且速度完全取决于它能够读取文件的速度。如果文件在Windows中被缓冲，则对整个文件的搜索只需要不到3秒钟。否则，根据网络速度，搜索可能需要7秒钟到60秒钟不等。

如果针对内存中的字符串运行搜索并且没有匹配的字符，则搜索本身只需要不到一秒钟。如果发现大量前导字符匹配，搜索可能需要更长时间。

public static int FindInFile(string fileName, string value)
{   // returns complement of number of characters in file if not found
    // else returns index where value found
    int index = 0;
    using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName))
    {
        if (String.IsNullOrEmpty(value))
            return 0;
        StringSearch valueSearch = new StringSearch(value);
        int readChar;
        while ((readChar = reader.Read()) >= 0)
        {
            ++index;
            if (valueSearch.Found(readChar))
                return index - value.Length;
        }
    }
    return ~index;
}
public class StringSearch
{   // Call Found one character at a time until string found
    private readonly string value;
    private readonly List<int> indexList = new List<int>();
    public StringSearch(string value)
    {
        this.value = value;
    }
    public bool Found(int nextChar)
    {
        for (int index = 0; index < indexList.Count; )
        {
            int valueIndex = indexList[index];
            if (value[valueIndex] == nextChar)
            {
                ++valueIndex;
                if (valueIndex == value.Length)
                {
                    indexList[index] = indexList[indexList.Count - 1];
                    indexList.RemoveAt(indexList.Count - 1);
                    return true;
                }
                else
                {
                    indexList[index] = valueIndex;
                    ++index;
                }
            }
            else
            {   // next char does not match
                indexList[index] = indexList[indexList.Count - 1];
                indexList.RemoveAt(indexList.Count - 1);
            }
        }
        if (value[0] == nextChar)
        {
            if (value.Length == 1)
                return true;
            indexList.Add(1);
        }
        return false;
    }
    public void Reset()
    {
        indexList.Clear();
    }
}

- DRBlaise

2

在所有情况下，您都必须检查整个文件。

查找Rabin-Karp字符串搜索或类似的内容。

- Pavel Radzivilovsky

2

不一定每次都需要搜索它。如果要搜索同一个文件很多次，建立文件索引可能是有意义的，这样只需要对整个文件进行一次遍历，就可以实现任意数量的快速查找。 - Daniel Earwicker

2

你的项目需要每次搜索不同的文件以查找相同或不同的字符串，还是每次搜索相同的文件以查找不同的字符串？

如果是后者，你可以建立文件的索引。但是，如果文件经常更改，建立索引就没有意义，因为建立索引会很昂贵。

要为全文搜索索引文件，你可以使用Lucene.NET库。

http://incubator.apache.org/lucene.net/

- Daniel Earwicker

2

搜索最快的方法是Boyer-Moore算法。这种方法不需要读取文件中的所有字节，但需要随机访问字节。此外，该方法实现简单。

- Vitaly Dyatlov

1

你应该能够逐个字符读取文件，并将每个字符与搜索字符串中的字符进行匹配，直到达到搜索字符串的末尾为止，这样就表示找到了匹配项。如果在任何时候，所读取的字符与所寻找的字符不匹配，就将匹配计数重置为0，然后重新开始。例如（****伪代码/未经测试****）：

byte[] lookingFor = System.Text.Encoding.UTF8.GetBytes("hello world");
int index = 0;
int position = 0;
bool matchFound = false;

using (FileStream fileStream = new FileStream(fileName, FileMode.Open))
{
  while (fileStream.ReadByte() == lookingFor[index])
  {
    index++;

    if (index == lookingFor.length) 
    {
       matchFound = true;
       position = File.position - lookingFor.length;
       break;
    }
  }
}

这是其中一个算法，你可以使用它（虽然长度检查可能会差一个）。它只会找到第一个匹配项，所以你可能需要将while循环包装在另一个循环中以查找多个匹配项。

另外，关于逐行读取文件的一件事情是，如果要匹配的字符串跨越多行，那么你将找不到它。如果这没问题，那么你可以逐行搜索，但如果你需要搜索字符串跨越多行，你将想使用类似上面详细说明的算法。

最后，如果你正在寻找最佳速度，听起来像是这样，你将想将以上代码迁移到使用StreamReader或其他缓冲读取器。

- Brian Hasden

1

你可以一次性将文件中大量数据缓冲到内存中，直到达到所需的限制，然后搜索字符串。

这样做可以减少文件读取次数，并且可能是一种更快的方法，但如果设置的缓冲区大小过高，则会占用更多的内存。

- Chris Kannon

1

如Wayne Cornish所说：逐行阅读可能是最好的方法。

例如，如果您将整个文件读入字符串，然后使用正则表达式搜索，这可能更加优美，但您将创建一个大的字符串对象。

这些类型的对象可能会导致问题，因为它们将存储在大对象堆（LOH）上（对于超过85,000字节的对象）。如果您解析许多这些大文件并且您的内存有限（x86），那么您很可能会遇到LOH碎片问题。

=> 如果您解析许多大文件，则最好逐行阅读！

- Knasterbax

1

这是一个逐个读取字符的简单单功能解决方案。对我来说效果很好。

/// <summary>
/// Find <paramref name="toFind"/> in <paramref name="reader"/>.
/// </summary>
/// <param name="reader">The <see cref="TextReader"/> to find <paramref name="toFind"/> in.</param>
/// <param name="toFind">The string to find.</param>
/// <returns>Position within <paramref name="reader"/> where <paramref name="toFind"/> starts or -1 if not found.</returns>
/// <exception cref="ArgumentNullException">When <paramref name="reader"/> is null.</exception>
/// <exception cref="ArgumentException">When <paramref name="toFind"/> is null or empty.</exception>
public int FindString(TextReader reader, string toFind)
{
    if(reader == null)
        throw new ArgumentNullException("reader");

    if(string.IsNullOrEmpty(toFind))
        throw new ArgumentException("String to find may not be null or empty.");

    int charsRead = -1;
    int pos = 0;
    int chr;

    do
    {
        charsRead++;
        chr = reader.Read();
        pos = chr == toFind[pos] ? pos + 1 : 0;
    }
    while(chr >= 0 && pos < toFind.Length);

    int result = chr < 0 ? -1 : charsRead - toFind.Length;
    return result < 0 ? -1 : result;
}

希望这有所帮助。

- Jochen

0

如果你想加快逐行读取的速度，可以创建一个基于队列的应用程序：
一个线程读取行并将它们排入线程安全的队列中。第二个线程可以处理这些字符串。

- Matthias

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wayne Cornish · Accepted Answer

考虑到文件的大小，您是否真的想事先将它们全部读入内存中？逐行阅读可能是最好的方法。