大文本文件的快速搜索

Question

大文本文件的快速搜索

3

我正在尝试用C#编写一个搜索程序，可以在一个大型文本文件（5GB）中搜索字符串。我已经编写了下面的简单代码，但搜索结果非常耗时，可能需要约30分钟才能完成。以下是我的代码：

public List<string> Search(string searchKey)
{
    List<string> results = new List<string>();
    StreamReader fileReader = new StreamReader("D:\Logs.txt");
    while ((line = fileReader.ReadLine()) != null)
    {
        if (line.Contains(searchKey)
        {
            results.Add(line);
        }
    }
}

虽然代码能够运行，但速度非常慢，需要大约30分钟才能完成。我们能否采取措施将搜索时间缩短到一分钟以内？

- Dhawal Dhingra

这个回答解决了你的问题吗？在大型文本文件中搜索字符串的最快方法 - styx

2

你只在一个文件中搜索一个字符串的所有匹配吗？如果是这样，似乎你不可能加快搜索速度。不过，用30分钟的时间在一个5GB文件中搜索一个字符串的所有匹配似乎确实太长了。它是否在网络连接上运行？有多少行匹配？ - Matthew Watson

顺便提一下，从语法上讲，您可以使此代码更加简洁：var results = File.ReadLines("D:\Logs.txt").Where(line => line.Contains(searchKey)).ToList();。 - maccettura

@MatthewWatson - 是的，我正在搜索一个文件，以查找一个字符串的所有匹配项。可能会有0到50个匹配项，平均大约有5个。是的，我正在通过网络连接运行它。 - Dhawal Dhingra

我建议计时仅执行循环中的文件读取部分（即将if（line.Contains（searchKey））{results.Add（line）;}部分注释掉）。这将为您提供时间下限。我猜想您会发现几乎所有时间都用于从网络读取。 - Matthew Watson

3个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Gopal Purohit · Answer 1

对于在一个非常大的文件中进行字符串搜索，可以使用Boyer Moore搜索算法，这是实际字符串搜索文献的标准基准。有关其实现，以下是链接:

- dynamicbutter · Answer 2

Gigantor 提供了一个名为RegexSearcher的工具，能够完成此任务。我使用了一份32GB的文件进行了测试，在我的MacBook Pro上只用了不到20秒。下面是相关代码：

Gigantor大幅提升了正则表达式的性能，并且适用于极大文件。你可以使用如下代码实现 Search函数，利用Gigantor来提高效率。

public List<string> Search(string path, string searchKey)
{
    // Create regex to search for the searchKey
    System.Text.RegularExpressions.Regex regex = new(searchKey);
    List<string> results = new List<string>();

    // Create Gigantor stuff
    System.Threading.AutoResetEvent progress = new(false);
    Imagibee.Gigantor.RegexSearcher searcher = new(
        path, regex, progress, maxMatchCount: 10000);

    // Start the search and wait for completion
    Imagibee.Gigantor.Background.StartAndWait(
        searcher,
        progress,
        (_) => { },
        1000);

    // Check for errors
    if (searcher.Error.Length != 0) {
        throw new Exception(searcher.Error);
    }

    // Open the searched file for reading
    using System.IO.FileStream fileStream = new(path, FileMode.Open);
    Imagibee.Gigantor.StreamReader reader = new(fileStream);

    // Capture the line of each match
    foreach (var match in searcher.GetMatchData()) {
        fileStream.Seek(match.StartFpos, SeekOrigin.Begin);
        results.Add(reader.ReadLine());
    }
    return results;
}

这是测试代码。

[Test]
public void SearchTest()
{
    var path = Path.Combine(Path.GetTempPath(), "enwik9x32");
    Stopwatch stopwatch = new();
    stopwatch.Start();
    var results = Search(path, "unicorn");
    stopwatch.Stop();
    Console.WriteLine($"found {results.Count} results in {stopwatch.Elapsed.TotalSeconds} seconds");
}

这是控制台输出

found 8160 results in 19.1458573 seconds

这里是Gigantor 源代码库。我知道有点晚了，但希望这个答案对某人有所帮助。

- Stanislav · Answer 3

文件索引功能已在库Bsa.Search.Core中实现

您可以实现自己的文件读取版本。 FileByLinesRowReader - 按行读取文件并添加具有外部ID等于行号的文档。 FileDocumentIndex 已在维基数据JSON字典上进行了测试。

.Net Core

.Net 472

     var selector = new IndexWordSelector();
     var morphology = new DefaultMorphology(new WordDictionary(), selector);
     var fileName = "D:\Logs.txt";

     // 你可以实现自己的文件读取器，csv、json或其他
     var index = new FileDocumentIndex(fileName, new FileByLinesRowReader(null), morphology);

     // 如果索引已经存在，我们跳过文件索引
         if (!index.IsIndexed)
     index.Start();
     while (!index.IsReady)
     {
         Thread.Sleep(300);
     }

     var query = "("one" | two) ~50 ("error*")".Parse("*");
     var found = index.Search(new SearchQueryRequest()
     {
         Field = "*",
         Query = query,
         ShowHighlight = true,
     });
     // 其中ExternalId是文件行号
     //found.ShardResult.First().FoundDocs.FirstOrDefault().Value.ExternalId