一个大文本文件中的词频统计

Question

一个大文本文件中的词频统计

c#multithreadingperformancealgorithmdata-structures

14

我正在尝试读取一个大文本文件，并输出其中不同的单词以及它们的计数。到目前为止，我已经尝试了几种方法，而这绝对是我想出的最快速的解决方案。

private static readonly char[] separators = { ' ' };

public IDictionary<string, int> Parse(string path)
{
    var wordCount = new Dictionary<string, int>();

    using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
    using (var streamReader = new StreamReader(fileStream))
    {
        string line;
        while ((line = streamReader.ReadLine()) != null)
        {
            var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);

            foreach (var word in words)
            {
                if (wordCount.ContainsKey(word))
                {
                    wordCount[word] = wordCount[word] + 1;
                }
                else
                {
                    wordCount.Add(word, 1);
                }
            }
        }
    }

    return wordCount;
}

如何测量我的解决方案

我有一个200MB的文本，我知道其中单词的总数（通过文本编辑器）。我正在使用Stopwatch类来计算单词数量以确保准确性并测量所需时间。到目前为止，大约需要9秒钟。

其他尝试

我尝试使用TPL库利用多线程拆分工作。这涉及批处理多个行、将批处理行的处理发送到单独的任务并在字典中锁定读/写操作。然而，这似乎没有提供任何性能改进。
这花费了大约30秒钟。我怀疑读/写字典的锁定成本太高，无法获得任何性能优势。
我还查看了ConcurrentDictionary类型，但根据我的理解，AddOrUpdate方法需要调用代码处理同步，并没有带来性能优势。

我相信有更快的方式来实现这一点！是否有更好的数据结构可用于此问题？

欢迎提出任何建议/批评——我想学习和进步！

谢谢。

更新：这是我使用的测试文件的链接。

- Prabu

1

你的源文件是什么？200MB 的文本可能相当于一整部百科全书！ - Jason Watkins

2

批处理多行。也许将整个内容分成核心数（n）的分区，并使用n个字典而不锁定它们会更好。然后将它们合并成一个可能会更快，特别是对于许多重复单词的情况。 - TaW

3

这似乎可以有效地使用MapReduce范例来解决。下面是一个答案，解释了如何将MapReduce应用于与您所提出的问题非常相似的情况：https://dev59.com/6Wct5IYBdhLWcg3wEJYm - John

1

@pwee167 只是出于好奇，你有没有测试过程序读取200MB文件所需的时间？因为200MB/9秒 = 21MB/秒... 这已经很不错了。 - xanatos

1

也许尝试逐行阅读并添加到列表/数组中，然后使用某种形式的并行性（例如TPL的foreach）对列表/数组中的每一行执行处理。您可以尝试使用并发字典来进行线程安全访问 https://msdn.microsoft.com/en-us/library/dd287191(v=vs.110).aspx - User92

显示剩余12条评论

6个回答

6

你的方法与大多数人处理它的方式一致。你注意到使用多线程并没有提供明显的好处是正确的，因为瓶颈很可能是IO限制，并且无论你有什么样的硬件，你都不能比硬件支持的速度更快地读取。

如果你真的想要提高速度（我怀疑你会得到任何提高），你可以尝试实现生产者-消费者模式，其中一个线程读取文件，其他线程处理行（然后在一行中并行检查单词）。在这里进行权衡，你将添加更复杂的代码以换取微小的改进（只有基准测试才能确定）。

编辑：还可以查看ConcurrentDictionary http://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem 和ConcurrentDictionary。

- BlackBox

我看了一下ConcurrentDictionary，本以为它可能会有用，但是从今天的研究来看，我相信AddOrUpdate方法不能保证不会出现脏读/写。 - Prabu

6

我只是简单地更改了以下内容，就获得了很大的改善（从200MB的文件中将时间减少了25秒到20秒）：

int cnt;

if (wordCount.TryGetValue(word, out cnt))
{
    wordCount[word] = cnt + 1;
}
else
....

一种基于ConcurrentDictionary<>和Parallel.ForEach（使用IEnumerable<>重载）的变体。请注意，我使用的是InterlockedInt而不是int，它使用Interlocked.Increment来增加自身。作为引用类型，它可以正确地与ConcurrentDictionary<>.GetOrAdd一起使用...

public class InterlockedInt
{
    private int cnt;

    public int Cnt
    {
        get
        {
            return cnt;
        }
    }

    public void Increment()
    {
        Interlocked.Increment(ref cnt);
    }
}

public static IDictionary<string, InterlockedInt> Parse(string path)
{
    var wordCount = new ConcurrentDictionary<string, InterlockedInt>();

    Action<string> action = line2 =>
    {
        var words = line2.Split(separators, StringSplitOptions.RemoveEmptyEntries);

        foreach (var word in words)
        {
            wordCount.GetOrAdd(word, x => new InterlockedInt()).Increment();
        }
    };

    IEnumerable<string> lines = File.ReadLines(path);
    Parallel.ForEach(lines, action);

    return wordCount;
}

请注意，使用Parallel.ForEach不如直接为每个物理核心使用一个线程高效（您可以在历史记录中查看）。虽然两种解决方案在我的PC上都只需不到10秒的“墙”时钟，但Parallel.ForEach使用了55秒的CPU时间，而Thread解决方案则使用了33秒。

还有一个被估值为5-10%的小技巧：

public static IEnumerable<T[]> ToBlock<T>(IEnumerable<T> source, int num)
{
    var array = new T[num];
    int cnt = 0;

    foreach (T row in source)
    {
        array[cnt] = row;
        cnt++;

        if (cnt == num)
        {
            yield return array;
            array = new T[num];
            cnt = 0;
        }
    }

    if (cnt != 0)
    {
        Array.Resize(ref array, cnt);
        yield return array;
    }
}

您可以将行分组（选择10到100之间的数字）成数据包，这样就会减少线程内部的通信。工作线程随后需要对接收到的行进行 foreach 操作。

- xanatos

好的。我会将这个改进加入到解决方案中。我本应该一开始就使用它！ - Prabu

@pwee167 你知道 ++wordCount[word] 也可以工作，对吧？ - Yuval Itzchakov

@YuvalItzchakov 你回到额外的一个读取...那将是无用的！ - xanatos

@LasseV.Karlsen 我的替换仅适用于if语句的初始部分...else部分保持不变。 - xanatos

在我的电脑上，这可以减少大约15％的时间。不算微不足道，但也不是很多：http://i.stack.imgur.com/EUrTC.png - Lasse V. Karlsen

显示剩余3条评论

2

使用一个200MB的文本文件，在我的电脑上以下操作仅需5秒不到。

    class Program
{
    private static readonly char[] separators = { ' ' };
    private static List<string> lines;
    private static ConcurrentDictionary<string, int> freqeuncyDictionary;

    static void Main(string[] args)
    {
        var stopwatch = new System.Diagnostics.Stopwatch();
        stopwatch.Start();

        string path = @"C:\Users\James\Desktop\New Text Document.txt";
        lines = ReadLines(path);
        ConcurrentDictionary<string, int> test = GetFrequencyFromLines(lines);

        stopwatch.Stop();
        Console.WriteLine(@"Complete after: " + stopwatch.Elapsed.TotalSeconds);
    }

    private static List<string> ReadLines(string path)
    {
        lines = new List<string>();
        using (var fileStream = File.Open(path, FileMode.Open, FileAccess.Read))
        {
            using (var streamReader = new StreamReader(fileStream))
            {
                string line;
                while ((line = streamReader.ReadLine()) != null)
                {
                    lines.Add(line);
                }
            }
        }
        return lines;            
    }

    public static ConcurrentDictionary<string, int> GetFrequencyFromLines(List<string> lines)
    {
        freqeuncyDictionary = new ConcurrentDictionary<string, int>();
        Parallel.ForEach(lines, line =>
        {
            var words = line.Split(separators, StringSplitOptions.RemoveEmptyEntries);

            foreach (var word in words)
            {
                if (freqeuncyDictionary.ContainsKey(word))
                {
                    freqeuncyDictionary[word] = freqeuncyDictionary[word] + 1;
                }
                else
                {
                    freqeuncyDictionary.AddOrUpdate(word, 1, (key, oldValue) => oldValue + 1);
                }
            }
        });

        return freqeuncyDictionary;
    }
}

- User92

我正在尝试避免将整个文件读入内存，因为在其他情况下文件可能会很大。 - Prabu

好的。在你的问题中，你提到尝试分批读取行并使用线程，但由于（你认为）锁定字典而导致速度变慢。在这种情况下，你尝试过使用并发字典吗？此外，使用你在问题中提供的文件，大约需要7秒钟，但使用TryGetValue后降至6秒钟。 - User92

我曾经研究过ConcurrentDictionary类型，一开始认为它可能很有用，但是今天的研究表明AddOrUpdate方法不能保证读写操作的干净性。这意味着我需要在字典中增加计数时进行自己的同步处理，这并没有给我带来任何性能上的提升。因此，我又回到了标准的字典类型。 - Prabu

1

如果您想计算特定单词的数量，可以使用函数strtok 链接在这里并将每个单词与要评估的单词进行比较，我认为这不会很耗费时间，但我从未尝试过大文件...

- Custodius

1

不，那不是他想要的。他想要创建一个包含所有单词及其计数的列表。strtok除了分割之外还有其他功能吗？ - TaW

好的，我误解了问题。strtok函数可以分割字符串，然后你再进行比较。但是对于多个单词来说，情况会更加复杂。非常抱歉我误解了问题，并感谢您的纠正。 - Custodius

1

我建议将流缓冲区大小设置得更大，并匹配：

    using (var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, 8192))
    using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, false, 8192))

首先，您的代码导致缓冲区过小，无法处理这种工作。其次，由于读取器的缓冲区比流的缓冲区小，因此数据首先被复制到流的缓冲区，然后再复制到读取器的缓冲区。对于您正在进行的类型的工作，这可能会破坏性能。

当缓冲区大小匹配时，流的缓冲区将永远不会被使用 - 实际上它也永远不会被分配。

- glenebob

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Lasse V. Karlsen · Accepted Answer

我能给出的最佳简短答案就是：测量、测量、测量。使用 Stopwatch 可以感受代码执行时间，但最终你会不得不在大段代码中加入它，或者找到更好的工具。我建议为此获取专用的性能分析器工具，对于 C# 和 .NET 有许多可用的。

我已经通过三个步骤成功地缩减了总运行时间的约43%。

首先，我测量了您的代码并获得了以下结果：

Original code measurements

这似乎表明我们可以尝试解决两个热点问题：

字符串拆分（SplitInternal）
字典维护（FindEntry、Insert、get_Item）

最后一部分时间花费在读取文件上，我真的怀疑我们改变那部分代码不会带来太多收益。这里的另一个答案提到使用特定的缓冲区大小，我尝试过这个方法，但无法获得可衡量的差异。

第一个问题——字符串拆分——比较容易解决，但需要将一个非常简单的 string.Split 调用重新编写成更多的代码。我将处理一行的循环改写为：

while ((line = streamReader.ReadLine()) != null)
{
    int lastPos = 0;
    for (int index = 0; index <= line.Length; index++)
    {
        if (index == line.Length || line[index] == ' ')
        {
            if (lastPos < index)
            {
                string word = line.Substring(lastPos, index - lastPos);
                // process word here
            }
            lastPos = index + 1;
        }
    }
}

我随后将一个单词的处理重写为以下方式：

int currentCount;
wordCount.TryGetValue(word, out currentCount);
wordCount[word] = currentCount + 1;

这取决于以下事实:

TryGetValue方法比检查单词是否存在并检索其当前计数更为便宜。
如果TryGetValue无法获取值（键不存在），则会将currentCount变量初始化为其默认值0。这意味着我们不需要真正检查单词是否存在。
我们可以通过索引器将新单词添加到字典中（它将覆盖现有值或向字典添加新的键+值）。

因此，最终循环看起来像这样:

while ((line = streamReader.ReadLine()) != null)
{
    int lastPos = 0;
    for (int index = 0; index <= line.Length; index++)
    {
        if (index == line.Length || line[index] == ' ')
        {
            if (lastPos < index)
            {
                string word = line.Substring(lastPos, index - lastPos);
                int currentCount;
                wordCount.TryGetValue(word, out currentCount);
                wordCount[word] = currentCount + 1;
            }
            lastPos = index + 1;
        }
    }
}

新的测量结果显示如下：

new measurement

详细信息:

我们从6876毫秒降到了5013毫秒
我们失去了在SplitInternal、FindEntry和get_Item中花费的时间
我们获得了在TryGetValue和Substring中花费的时间

这是差异详细信息：

difference

正如您所看到的，我们失去的时间比我们获得的新时间多，这导致了净改善。

然而，我们可以做得更好。我们在这里进行了2个字典查找，这涉及计算单词的哈希码，并将其与字典中的键进行比较。第一个查找是TryGetValue的一部分，第二个查找是wordCount[word] = ...的一部分。

我们可以通过在字典内创建智能数据结构来消除第二个字典查找，代价是使用更多的堆内存。

我们可以使用Xanatos的技巧，在对象内部存储计数，以便我们可以删除第二个字典查找：

public class WordCount
{
    public int Count;
}

...

var wordCount = new Dictionary<string, WordCount>();

...

string word = line.Substring(lastPos, index - lastPos);
WordCount currentCount;
if (!wordCount.TryGetValue(word, out currentCount))
    wordCount[word] = currentCount = new WordCount();
currentCount.Count++;

这将只从词典中提取计数，增加1个额外出现不涉及词典。该方法的结果也会更改为返回此WordCount类型作为词典的一部分，而不仅仅是一个int。

最终结果：节省约43%。

最终结果

最终代码：

public class WordCount
{
    public int Count;
}

public static IDictionary<string, WordCount> Parse(string path)
{
    var wordCount = new Dictionary<string, WordCount>();

    using (var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.None, 65536))
    using (var streamReader = new StreamReader(fileStream, Encoding.Default, false, 65536))
    {
        string line;
        while ((line = streamReader.ReadLine()) != null)
        {
            int lastPos = 0;
            for (int index = 0; index <= line.Length; index++)
            {
                if (index == line.Length || line[index] == ' ')
                {
                    if (lastPos < index)
                    {
                        string word = line.Substring(lastPos, index - lastPos);
                        WordCount currentCount;
                        if (!wordCount.TryGetValue(word, out currentCount))
                            wordCount[word] = currentCount = new WordCount();
                        currentCount.Count++;
                    }
                    lastPos = index + 1;
                }
            }
        }
    }

    return wordCount;
}