我尝试了一种使用
embed
和
data.table
的方法。使用提供的示例数据进行测试时,它与其他
data.table
方法相比具有竞争力(请参见下面的基准测试),但仍然稍慢一些。当扩展到其他滞后时,
embed
方法可能会更快,但我不确定这是否相关。
无论如何,我整理了答案(截至目前),并比较了时间和输出。我不知道精确的输出对您有多大影响(例如,我在基准测试中失去了一些时间,因为我必须转储RET列),但请注意,不同的答案在输出格式/内容上略有不同。所有方法都提供类似于所需输出格式的结果。
我想知道不同的方法是否针对不同大小的数据框按比例缩放...如果您测试了这些方法,我很想知道哪个对您和您的数据最快! :)
数据和库
library("data.table")
library("sqldf")
library("microbenchmark")
DATE <- c("01.01.2000", "02.01.2000", "03.01.2000", "06.01.2000", "07.01.2000", "09.01.2000", "10.01.2000", "01.01.2000", "02.01.2000", "04.01.2000", "06.01.2000", "07.01.2000", "09.01.2000", "10.01.2000")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B")
df0 <- data.frame(DATE, RET, COMP)
DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000")
ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23")
COMP <- c("A","A","A","A","B","B","B")
event0 <- data.frame(DATE, ARTICLE, COMP)
rbatt(本回答)
m2l <- function(x) split(x, rep(1:ncol(x), each = nrow(x)))
e2 <- function(x, d=1) m2l(rbind(matrix(NA, ncol=d, nrow=d-1), embed(x,d)))
testRB <- function(df=df0, event=event0){
dt1 <- as.data.table(df)
dt1[,DATE:=as.character(DATE)]
dt1[,c("DATEp1","DATE","DATEm1"):=e2(DATE,3),by=COMP]
dt1[,RET:=NULL]
setkey(dt1, COMP, DATE, DATEp1, DATEm1)
dt2 <- as.data.table(event)
dt2[,DATE:=as.character(DATE)]
setkey(dt2,COMP,DATE)
dt <- dt1[dt2]
dt
}
rbatt 输出:
编辑 - DA优化 #1 (旧代码已注释)
编辑 - DA优化 #2 (旧代码已注释,版本标记)
# 2) DATE、DATEm1、DATEp1 的格式
testDA <- function(df=df0, event=event0){
tempdf <- data.table(df, key = c("COMP", "DATE"))$DATE
setDT(event)[order(COMP, DATE), `:=` (
DATEm1 = tempdf[match(DATE, tempdf) - 1],
DATEp1 = tempdf[match(DATE, tempdf) + 1]
)]
event
}
David Arenburg输出:
经过DA优化#1进行编辑(#2可能存在错误)
注意第7行列“DATEm1”中的错误内容,月份应为04
testGG <- function(df=df0, event=event0){
df2 <- transform(df, DATE = format(as.Date(DATE, "%m.%d.%Y")))
event2 <- transform(event, DATE = format(as.Date(DATE, "%m.%d.%Y")))
result <- sqldf(c("create index i on df2(COMP, DATE)",
"select
event.DATE,
max(A.DATE) DATEm1,
min(B.DATE) DATEp1,
event.ARTICLE,
event.COMP
from event2 event, main.df2 A, main.df2 B
on event.COMP = A.COMP and event.COMP = B.COMP
and event.DATE > A.DATE and event.DATE < B.DATE
group by event.DATE, event.COMP
order by event.COMP, event.DATE"))
result
}
GG输出:
testAR <- function(df=df0, event=event0){
dt1 = as.data.table(df)
dt2 = as.data.table(event)
key_cols = c("COMP", "DATE")
setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols)))
setkeyv(dt1, key_cols)
idx1 = dt1[dt2, which=TRUE, mult="first"]-1L
idx2 = dt1[dt2, which=TRUE, mult="last"]+1L
idx1[idx1 == 0L] = NA
dt2[, `:=`(DATEm1 = dt1$DATE[idx1],
DATEp1 = dt1$DATE[idx2]
)]
dt2
}
Arun的输出:
基准测试
编辑 - 请注意,这是原始基准测试(原始代码,原始 OP 数据集)
microbenchmark(testAR(), testDA(), testRB(), testGG())
修改:更大数据集的基准测试
请注意,我在此基准测试中删除了testGG()
,因为它要慢得多(我对几个中间数据集进行了一些测试,并且tetGG()
比其他三种方法的扩展性差)。
mos <- c("01","02","03","06","07","09","10", "01", "02", "04", "06", "07", "09", "10")
yrs <- 1920:2020
DATE <- paste(mos, "01", rep(yrs, each=length(mos)), sep=".")
RET <- rep(c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12), length(yrs))
COMP <- rep(c("A","A","A","A","A","A","A","B","B","B","B","B","B","B"), length(yrs))
df0 <- data.frame(DATE, RET, COMP)
mos2 <- c("02","03","06","09","06","07","09")
DATE <- paste(mos2, "01", rep(yrs, each=length(mos2)), sep=".")
ARTICLE <- rep(c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23"), length(yrs))
COMP <- rep(c("A","A","A","A","B","B","B"), length(yrs))
event0 <- data.frame(DATE, ARTICLE, COMP)
编辑 - 大型数据集的原始基准测试:
编辑 - DA 优化后大数据集的基准测试#1:
编辑- DA 优化后大数据集的基准测试 #2:
注 - 来自 testDA() 的更新 #2 警告:
对大数据集进行内存和时间分析,每次迭代50次
代码分析
Rprof("testAR.out", memory.profiling=TRUE)
for(i in 1:50){
arAns <- testAR()
}
Rprof(NULL)
Rprof("testDA.out", memory.profiling=TRUE)
for(i in 1:50){
daAns <- testDA()
}
Rprof(NULL)
Rprof("testRB.out", memory.profiling=TRUE)
for(i in 1:50){
rbAns <- testRB()
}
Rprof(NULL)
testAR() 个人资料结果
testDA() 的配置结果
testRB()配置文件结果
结论
就我所知,Arun的答案是最快且最节省内存的。与DA的答案相比,rbatt的答案在数据集大小方面更具可扩展性-我的初步猜测是使用POSIX或Date类的方法可能不会很好地扩展,但我不确定这个直觉是否得到了剖析结果的支持。如果有人认为提供完整的剖析结果而不仅仅是$by.self
部分可能有帮助,我可以提供。
值得注意的是,各种方法所花费的时间和所使用的内存之间存在负相关关系-最快的方法使用的内存最少。