data.table fread和ISO8601

Question

data.table fread和ISO8601

4

我感到有些愚蠢，因为我无法看出问题所在...

根据 NEWS 文件的描述，fread 可以正确识别 ISO 8601 时间戳格式，例如 2020-07-24T10:11:12.134Z（自版本1.13.0起）。但实际上并不正确：

fread(text=c("now","2020-07-24T10:11:12.134Z"), colClasses="POSIXct", sep=",")
#           now
#        <POSc>
# 1: 2020-07-24

但是，如果我将 T 改为空格，则会返回正确的时间戳：

fread(text=c("now","2020-07-24 10:11:12.134Z"), colClasses="POSIXct", sep=",")
#                    now
#                 <POSc>
# 1: 2020-07-24 10:11:12

如果我使用 tz = "" 或 tz = "UTC"，仍然会出现这个问题。(不出所料，如果我省略 colClasses = ，它甚至不会尝试转换。)

我做错了什么导致的内部和更快的POSIXct转换器无法工作？如果需要，我知道如何在读取后进行此操作，但是对于使用as.POSIXct 的大文件来说，后处理非常耗时。

(Windows-11, R-4.1.2, data.table-1.14.2)

如果感兴趣， verbose = TRUE 似乎并没有提供太多见解：

fread(text=c("now","2020-07-24T10:11:12.134Z"), colClasses="POSIXct", sep=",", verbose=TRUE)
#   OpenMP version (_OPENMP)       201511
#   omp_get_num_procs()            16
#   R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
#   R_DATATABLE_NUM_THREADS        unset
#   R_DATATABLE_THROTTLE           unset (default 1024)
#   omp_get_thread_limit()         2147483647
#   omp_get_max_threads()          16
#   OMP_THREAD_LIMIT               unset
#   OMP_NUM_THREADS                unset
#   RestoreAfterFork               true
#   data.table is using 8 threads with throttle==1024. See ?setDTthreads.
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
#   Using 8 threads (omp_get_max_threads()=16, nth=8)
#   NAstrings = [<<NA>>]
#   None of the NAstrings look like numbers.
#   show progress = 1
#   0/1 column will be read as integer
# [02] Opening the file
#   Opening file C:\Users\r2\AppData\Local\Temp\Rtmpao7n9S\file49384a01388a
#   File opened, size = 31 bytes.
#   Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
#   \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
# [05] Skipping initial rows if needed
#   Positioned on line 1 starting: <<now>>
# [06] Detect separator, quoting rule, and ncolumns
#   Using supplied sep ','
#   No sep and quote rule found a block of 2x2 or greater. Single column input.
#   Detected 1 columns on line 1. This line is either column names or first data row. Line starts as: <<now>>
#   Quote rule picked = 0
#   fill=false and the most number of columns found is 1
# [07] Detect column types, good nrow estimate and whether first row is column names
#   Number of sampling jump points = 1 because (29 bytes from row 1 to eof) / (2 * 29 jump0size) == 0
#   Type codes (jump 000)    : C  Quote rule 0
#   'header' determined to be true because all columns are type string and a better guess is not possible
#   All rows were sampled since file is small so we know nrow=1 exactly
# [08] Assign column names
# [09] Apply user overrides on column types
#   After 0 type and 0 drop user overrides : C
# [10] Allocate memory for the datatable
#   Allocating 1 column slots (1 - 0 dropped) with 1 rows
# [11] Read the data
#   jumps=[0..1), chunk_size=1048576, total_size=24
# Read 1 rows x 1 columns from 31 bytes file in 00:00.000 wall clock time
# [12] Finalizing the datatable
#   Type counts:
#          1 : string    'C'
# =============================
#    0.000s (  0%) Memory map 0.000GB file
#    0.000s (  0%) sep='' ncol=1 and header detection
#    0.000s (  0%) Column type detection using 1 sample rows
#    0.000s (  0%) Allocation of 1 rows x 1 cols (0.000GB) of which 1 (100%) rows used
#    0.000s (  0%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 1 rows) using 1 threads
#    +    0.000s (  0%) Parse to row-major thread buffers (grown 0 times)
#    +    0.000s (  0%) Transpose
#    +    0.000s (  0%) Waiting
#    0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
#    0.000s        Total
#           now
#        <POSc>
# 1: 2020-07-24

fread(text=c("now","2020-07-24 10:11:12.134Z"), colClasses="POSIXct", sep=",", verbose=TRUE)
#   OpenMP version (_OPENMP)       201511
#   omp_get_num_procs()            16
#   R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
#   R_DATATABLE_NUM_THREADS        unset
#   R_DATATABLE_THROTTLE           unset (default 1024)
#   omp_get_thread_limit()         2147483647
#   omp_get_max_threads()          16
#   OMP_THREAD_LIMIT               unset
#   OMP_NUM_THREADS                unset
#   RestoreAfterFork               true
#   data.table is using 8 threads with throttle==1024. See ?setDTthreads.
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
#   Using 8 threads (omp_get_max_threads()=16, nth=8)
#   NAstrings = [<<NA>>]
#   None of the NAstrings look like numbers.
#   show progress = 1
#   0/1 column will be read as integer
# [02] Opening the file
#   Opening file C:\Users\r2\AppData\Local\Temp\Rtmpao7n9S\file493817cf4117
#   File opened, size = 31 bytes.
#   Memory mapped ok
# [03] Detect and skip BOM
# [04] Arrange mmap to be \0 terminated
#   \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
# [05] Skipping initial rows if needed
#   Positioned on line 1 starting: <<now>>
# [06] Detect separator, quoting rule, and ncolumns
#   Using supplied sep ','
#   No sep and quote rule found a block of 2x2 or greater. Single column input.
#   Detected 1 columns on line 1. This line is either column names or first data row. Line starts as: <<now>>
#   Quote rule picked = 0
#   fill=false and the most number of columns found is 1
# [07] Detect column types, good nrow estimate and whether first row is column names
#   Number of sampling jump points = 1 because (29 bytes from row 1 to eof) / (2 * 29 jump0size) == 0
#   Type codes (jump 000)    : C  Quote rule 0
#   'header' determined to be true because all columns are type string and a better guess is not possible
#   All rows were sampled since file is small so we know nrow=1 exactly
# [08] Assign column names
# [09] Apply user overrides on column types
#   After 0 type and 0 drop user overrides : C
# [10] Allocate memory for the datatable
#   Allocating 1 column slots (1 - 0 dropped) with 1 rows
# [11] Read the data
#   jumps=[0..1), chunk_size=1048576, total_size=24
# Read 1 rows x 1 columns from 31 bytes file in 00:00.000 wall clock time
# [12] Finalizing the datatable
#   Type counts:
#          1 : string    'C'
# =============================
#    0.000s (  0%) Memory map 0.000GB file
#    0.000s (  0%) sep='' ncol=1 and header detection
#    0.000s (  0%) Column type detection using 1 sample rows
#    0.000s (  0%) Allocation of 1 rows x 1 cols (0.000GB) of which 1 (100%) rows used
#    0.000s (  0%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 1 rows) using 1 threads
#    +    0.000s (  0%) Parse to row-major thread buffers (grown 0 times)
#    +    0.000s (  0%) Transpose
#    +    0.000s (  0%) Waiting
#    0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
#    0.000s        Total
#                        now
#                     <POSc>
# 1: 2020-07-24 10:11:12.134

这种行为在使用 text= 代替 file= 时不会改变。

- r2evans

2

这很奇怪，因为对我来说它可以正常运行。请参见下面的代码：

library(data.table)
fread(text=c("now","2020-07-24T10:11:12.134Z"), colClasses="POSIXct", sep=",")
#>                    now
#> 1: 2020-07-24 10:11:12
#>
#>  data.table  * 1.14.2  2021-09-27 [1] CRAN (R 4.1.2)

编辑：抱歉格式不好编辑2：无论是否指定 colClasses 参数，它都有效。 - Daniel Molitor

1

我也试了一下，但只有在删除col_classes参数后（即fread(text=c("now","2020-07-24T10:11:12.134Z"), sep=",")）才有效。 - langtang

抱歉！无论是带还是不带col_classes，都对我有效！这是我的错误！ - langtang

1

哦，现在真是令人沮丧...好的，谢谢你们两个，显然是我的Windows系统出了问题，因为我刚刚启动了一个rocker/shiny-verse:4.1.2的Docker实例，无法重现这个问题... - r2evans

我希望其他人也能够重现这个问题，但我认为我会将其提交为一个错误报告... - r2evans

@langtang，找到了，谢谢你的帮助。 - r2evans

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- r2evans · Accepted Answer

不确定这是否是设计的一部分，但罪魁祸首是keepLeadingZeros=TRUE，这是我为了其他原因设置的选项。

withr::with_options(
  list(datatable.keepLeadingZeros=FALSE), 
  fread(text=c("now","2020-07-24T10:11:12.134Z"), sep=",")
)
#                        now
#                     <POSc>
# 1: 2020-07-24 10:11:12.134

withr::with_options(
  list(datatable.keepLeadingZeros=TRUE), 
  fread(text=c("now","2020-07-24T10:11:12.134Z"), sep=",")
)
#                         now
#                      <char>
# 1: 2020-07-24T10:11:12.134Z

事后发现，在 https://github.com/Rdatatable/data.table/issues/4869 中存在重复问题，即"keepLeadingZeros interferes with date recognition"。

提醒其他人和我未来的自己，我发现这个问题的方法是从R --vanilla --no-init --no-save开始，安装deta.table，然后开始测试：

### in "failing" environment:
opts <- options()
opts <- opts[ !sapply(opts, inherits, c("list", "function")) ]
dput(opts) # paste into the fresh R instance as opts2

### in the "fresh "environment"
# opts2 <- structure(...) # 'opts' from above
opts <- options()
opts <- opts[ !sapply(opts, inherits, c("list", "function")) ]
str(opts[ setdiff(names(opts2), names(opts)) ])

一遍一遍地启用选项，直到自动转换失败。