





因此,将freadread.csv(filename, colClasses=, nrows=, etc)进行比较...




> fread("test.csv",verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.486 GB
File is opened and mapped ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','
Found 6 columns
First row with 6 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 10000001
Subtracted 1 for last eol and any trailing empty lines, leaving 10000000 data rows
Type codes (   first 5 rows): 113431
Type codes (+ middle 5 rows): 113431
Type codes (+   last 5 rows): 113431
Type codes: 113431 (after applying colClasses and integer64)
Type codes: 113431 (after applying drop or select (if supplied)
Allocating 6 column slots (6 - 0 dropped)
Read 10000000 rows and 6 (of 6) columns from 0.486 GB file in 00:00:44
  13.420s ( 31%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   3.210s (  7%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   1.310s (  3%) Allocation of 10000000x6 result (xMB) in RAM
  25.580s ( 59%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.040s (  0%) Changing na.strings to NA
  43.560s        Total

注意:这些时间是我非常慢的笔记本电脑上执行的,没有SSD。每个步骤的绝对时间和相对时间会因机器而异。例如,如果您重新运行fread 第二次,您可能会注意到映射的时间要少得多,因为您的操作系统已将其缓存自前一次运行。

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            20
Model:                 2
Stepping:              0
CPU MHz:               800.000         # i.e. my slow netbook
BogoMIPS:              1995.01
Virtualisation:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
NUMA node0 CPU(s):     0,1

read.csv() 会将所有内容读入一个大字符矩阵,然后进行修改,那么 fread() 是否也是这样呢?在 fastread 中,我们猜测列类型,然后在转换时进行强制转换,以避免完全复制数据框。 - hadley
@hadley 不会啊。你为什么会这么认为呢?fastread是什么? - Matt Dowle
@hadley 好的,我看到了你的代码库。你为什么要这样做呢? - Matt Dowle
对我们来说,似乎有一种明显的方法可以减少内存使用——即不是创建完整的字符向量然后强制转换为数字向量,而是在进行转换时进行强制转换。 - hadley
@hadley 为什么你写了“(to us!)”?你在说什么创建完整字符向量?你是否声称fread不会在进行强制转换时执行操作? - Matt Dowle
对我们来说显而易见并不意味着对每个人都显而易见,也不意味着是正确的。我并没有对fread()提出任何建议。 - hadley

网页内容由stack overflow 提供, 点击上面的