你选择的例子不太恰当,正如 Tudor 所指出的那样。磁盘硬件受到移动碟片和磁头的物理限制,最有效的读取实现方式是按顺序读取每个块,这减少了移动磁头或等待磁盘对齐的需要。
话虽如此,一些操作系统并不总是将数据连续存储在磁盘上,对于那些还记得的人来说,如果你的操作系统/文件系统没有为你完成这项工作,碎片整理可以提高磁盘性能。
既然你提到想要一个可以受益的程序,让我建议一个简单的程序,矩阵加法。
假设你为每个核心创建了一个线程,你可以轻松地将任何两个要相加的矩阵分成 N 行(每个线程一个)。矩阵加法(如果你还记得)的工作方式如下:
A + B = C
或者
[ a11, a12, a13 ] [ b11, b12, b13] = [ (a11+b11), (a12+b12), (a13+c13) ]
[ a21, a22, a23 ] + [ b21, b22, b23] = [ (a21+b21), (a22+b22), (a23+c23) ]
[ a31, a32, a33 ] [ b31, b32, b33] = [ (a31+b31), (a32+b32), (a33+c33) ]
为了将此分配到N个线程中,我们只需要获取行数并模除线程数以获得要添加的“线程ID”即可。
matrix with 20 rows across 3 threads
row % 3 == 0 (for rows 0, 3, 6, 9, 12, 15, and 18)
row % 3 == 1 (for rows 1, 4, 7, 10, 13, 16, and 19)
row % 3 == 2 (for rows 2, 5, 8, 11, 14, and 17)
// row 20 doesn't exist, because we number rows from 0
现在,每个线程“知道”它应该处理哪些行,因为结果
不会跨越到其他线程的计算领域,所以每行的结果可以轻松计算。
现在只需要一个“结果”数据结构来跟踪值何时被计算,当最后一个值被设置时,计算就完成了。在这个“虚假”的示例中,使用两个线程计算矩阵加法结果大约需要一半的时间。
// the following assumes that threads don't get rescheduled to different cores for
// illustrative purposes only. Real Threads are scheduled across cores due to
// availability and attempts to prevent unnecessary core migration of a running thread.
[ done, done, done ] // filled in at about the same time as row 2 (runs on core 3)
[ done, done, done ] // filled in at about the same time as row 1 (runs on core 1)
[ done, done, .... ] // filled in at about the same time as row 4 (runs on core 3)
[ done, ...., .... ] // filled in at about the same time as row 3 (runs on core 1)
多线程可以解决更复杂的问题,不同的问题需要不同的技术来解决。我故意选择了一个最简单的例子。