R中data.table包中fread函数速度快的原因

Question

R中data.table包中fread函数速度快的原因

rperformancedata.tablefread

28

我对data.table中的fread函数在大型数据文件上的速度感到惊讶，但它是如何管理如此快速读取的？fread和read.csv之间的基本实现差异是什么？

- Vijay

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Matt Dowle · Accepted Answer

假设我们已经应用了所有已知的建议，例如设置colClasses、nrows等，并将其与read.csv进行比较。如果没有其他参数，read.csv(filename)是缓慢的，主要原因是它首先将所有内容读入内存，就好像它们是character，然后尝试将其作为第二步强制转换为integer或numeric。

因此，将fread与read.csv(filename, colClasses=, nrows=, etc)进行比较...

它们都是使用C编写的，所以不是这个原因。

没有特别的原因，但实际上，fread会将文件映射到内存中，然后使用指针迭代文件。而read.csv通过连接将文件读入缓冲区中。

如果您使用verbose=TRUE运行fread，它将告诉您它的工作方式并报告每个步骤花费的时间。例如，请注意它直接跳转到文件的中间和结尾，以更好地猜测列类型（尽管在这种情况下，前五项就足够了）。

> fread("test.csv",verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.486 GB
File is opened and mapped ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','
Found 6 columns
First row with 6 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 10000001
Subtracted 1 for last eol and any trailing empty lines, leaving 10000000 data rows
Type codes (   first 5 rows): 113431
Type codes (+ middle 5 rows): 113431
Type codes (+   last 5 rows): 113431
Type codes: 113431 (after applying colClasses and integer64)
Type codes: 113431 (after applying drop or select (if supplied)
Allocating 6 column slots (6 - 0 dropped)
Read 10000000 rows and 6 (of 6) columns from 0.486 GB file in 00:00:44
  13.420s ( 31%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   3.210s (  7%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   1.310s (  3%) Allocation of 10000000x6 result (xMB) in RAM
  25.580s ( 59%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.040s (  0%) Changing na.strings to NA
  43.560s        Total

注意：这些时间是我非常慢的笔记本电脑上执行的，没有SSD。每个步骤的绝对时间和相对时间会因机器而异。例如，如果您重新运行fread 第二次，您可能会注意到映射的时间要少得多，因为您的操作系统已将其缓存自前一次运行。

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            20
Model:                 2
Stepping:              0
CPU MHz:               800.000         # i.e. my slow netbook
BogoMIPS:              1995.01
Virtualisation:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
NUMA node0 CPU(s):     0,1