在任意时间范围内统计值的聚合（计数）出现次数

Question

在任意时间范围内统计值的聚合（计数）出现次数

3

我有一个带有时间戳和特定事件类型的CSV文件。我想要的是在6分钟间隔内计算某些事件类型发生的次数。

输入数据如下：

date,type
"Sep 22, 2011 12:54:53.081240000","2"
"Sep 22, 2011 12:54:53.083493000","2"
"Sep 22, 2011 12:54:53.084025000","2"
"Sep 22, 2011 12:54:53.086493000","2"

我用这段代码加载并处理数据：

> raw_data <- read.csv('input.csv')
> cured_dates <- c(strptime(raw_data$date, '%b %d, %Y %H:%M:%S', tz="CEST"))
> cured_data <- data.frame(cured_dates, c(raw_data$type))
> colnames(cured_data) <- c('date', 'type')

英译中：

处理后的数据如下：

> head(cured_data)
                 date type
1 2011-09-22 14:54:53    2
2 2011-09-22 14:54:53    2
3 2011-09-22 14:54:53    2
4 2011-09-22 14:54:53    2
5 2011-09-22 14:54:53    1
6 2011-09-22 14:54:53    1

我阅读了很多关于xts和zoo的示例，但某些方面我仍然无法理解。输出数据应该长这样：

date                       type   count
2011-09-22 14:54:00 CEST   1      11
2011-09-22 14:54:00 CEST   2      19
2011-09-22 15:00:00 CEST   1      9
2011-09-22 15:00:00 CEST   2      12
2011-09-22 15:06:00 CEST   1      23
2011-09-22 15:06:00 CEST   2      18

动物园的聚合函数看起来很有前途，我找到了这段代码片段：

# aggregate POSIXct seconds data every 10 minutes
tt <- seq(10, 2000, 10)
x <- zoo(tt, structure(tt, class = c("POSIXt", "POSIXct")))
aggregate(x, time(x) - as.numeric(time(x)) %% 600, mean)

现在我只是在想如何将其应用于我的使用案例。

尽管我很天真，但还是试了一下：

> zoo_data <- zoo(cured_data$type, structure(cured_data$time, class = c("POSIXt", "POSIXct")))
> aggr_data = aggregate(zoo_data$type, time(zoo_data$time), - as.numeric(time(zoo_data$time)) %% 360, count)
Error in `$.zoo`(zoo_data, type) : not possible for univariate zoo series

我必须承认我对R不是很有信心，但我会尝试。:-)

我有点迷茫。有人能指点一下我吗？

非常感谢！祝好，Alex。

这里是dput输出的我的数据的一个小子集。数据本身大约有8000万行。

structure(list(date = structure(c(1316697885, 1316697885, 1316697885, 
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 
1316697885, 1316697885), class = c("POSIXct", "POSIXt"), tzone = ""), 
    type = c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 
    1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L)), .Names = c("date", 
"type"), row.names = c(NA, -23L), class = "data.frame")

- Alexander Janssen

首先，非常感谢迄今为止所有的回复！我现在将逐一查看您提供的不同提示，并让您知道我取得了多少进展。 - Alexander Janssen

2个回答

2

你已经接近成功了。现在你需要创建一个与动物园相关的数据版本，并将其映射到聚合.zoo代码中。由于你想按时间和类型分类，所以你对aggregate.zoo的第二个参数必须更加复杂，并且你需要计数而不是平均数，因此应该使用length()。我认为count不是基本的R或zoo函数，而且我在工作区域中看到的唯一的count函数来自pkg：plyr，所以我不知道它是否能很好地与aggregate.zoo配合使用。length对于向量的处理方式符合大多数人的预期，但在处理数据框时经常会让人感到惊讶。如果你用length没有得到想要的结果，那么你应该看看NROW是否可以代替（在你的数据布局下，它们都可以成功）：使用新的数据对象时，有必要先放置类型参数。而且聚合/动物园只处理单类别分类器，所以你需要使用as.vector去除它的动物园特性。

with(cured_data, 
     aggregate(as.vector(x), list(type = type, 
                                   interval=as.factor(time(x) - as.numeric(time(x)) %% 360)),
                             FUN=NROW) 
 )

#  interval            x 
#1 2011-09-22 09:24:00 12
#2 2011-09-22 09:24:00 11

这是一个修改自你获取代码的示例（来自WizaRd Dirk在SO上的一个示例）：在任意时间范围内聚合（计数）值的出现次数

tt <- seq(10, 2000, 10)
x <- zoo(tt, structure(tt, class = c("POSIXt", "POSIXct")))
aggregate(as.vector(x), by=list(cat=as.factor(x), 
     tms = as.factor(index(x) - as.numeric(index(x)) %% 600)), length)

   cat                 tms  x
1    1 1969-12-31 19:00:00 26
2    2 1969-12-31 19:00:00 22
3    3 1969-12-31 19:00:00 11
4    1 1969-12-31 19:10:00 17
5    2 1969-12-31 19:10:00 28
6    3 1969-12-31 19:10:00 15
7    1 1969-12-31 19:20:00 17
8    2 1969-12-31 19:20:00 16
9    3 1969-12-31 19:20:00 27
10   1 1969-12-31 19:30:00  8
11   2 1969-12-31 19:30:00  4
12   3 1969-12-31 19:30:00  9

- IRTFM

嘿，到目前为止看起来还不错，但它只向我显示了type=1的聚合数据：https://gist.github.com/8049f54780cf0f18147b 嗯嗯嗯！我会更深入地研究它。 - Alexander Janssen

更好地展示您的数据，您将获得更快、更好、经过测试的答案。看看函数 dput。 - IRTFM

抱歉没有表达得足够精确，我很感激你的努力。我将dput的输出添加到我的原始帖子中。 - Alexander Janssen

所有这些“日期”都是相同的。我以为你想要某种系列计算？我一直按照你的话来做，认为聚合代码按照你的要求工作，但你从未引用它来自哪里。 - IRTFM

是的，仅在这个系列的这一部分。整个时间范围大约为45分钟（它是来自ISP的登录/注销统计数据），每秒钟大约有200-400行，其中类型为1（RADIUS会计请求开始）或类型为2（RADIUS会计请求停止）。我想知道每6分钟有多少个开始/停止。原始数据具有微秒精度的时间。 - Alexander Janssen

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- G. Grothendieck · Accepted Answer

我们可以使用read.csv读取它，将第一列转换为日期时间，并分成6分钟的间隔，然后添加一个1的虚拟列。然后使用read.zoo重新读取它，按类型拆分并在虚拟列上聚合：

# test data

Lines <- 'date,type
"Sep 22, 2011 12:54:53.081240000","2"
"Sep 22, 2011 12:54:53.083493000","2"
"Sep 22, 2011 12:54:53.084025000","2"
"Sep 22, 2011 12:54:53.086493000","2"
"Sep 22, 2011 12:54:53.081240000","3"
"Sep 22, 2011 12:54:53.083493000","3"
"Sep 22, 2011 12:54:53.084025000","3"
"Sep 22, 2011 12:54:53.086493000","4"'

library(zoo)
library(chron)

# convert to chron and bin into 6 minute bins using trunc
# Also add a dummy column of 1's 
# and remove any leading space (removing space not needed if there is none)

DF <- read.csv(textConnection(Lines), as.is = TRUE)
fmt <- '%b %d, %Y %H:%M:%S'
DF <- transform(DF, dummy = 1,
         date = trunc(as.chron(sub("^ *", "", date), format = fmt), "00:06:00"))

# split and aggregate

z <- read.zoo(DF, split = 2, aggregate = length)

使用上述测试数据，解决方案如下：

> z
                    2 3 4
(09/22/11 12:54:00) 4 3 1

请注意，上述操作是在宽格式下完成的，因为该格式构成了时间序列，而长格式则没有。每种类型都有一列。在我们的测试数据中，有2、3和4三种类型，因此有三列。

我们使用chron库是因为它的trunc方法非常适合将数据分组成6分钟。虽然chron库不支持时区，但这也可以成为一个优势，因为这样可以避免许多可能出现的时区错误。如果您需要POSIXct，请在最后进行转换，例如time(z) <- as.POSIXct(paste(as.Date.dates(time(z)), times(time(z)) %% 1)) 。这个表达式在R News 4/1文章的一个表格中显示，只是我们使用了as.Date.dates而不是as.Date来解决自那时以来似乎已经引入的一个bug。我们也可以使用time(z) <- as.POSIXct(time(z))，但这会导致不同的时区。

编辑：

原始解决方案将数据分组成日期，但我后来注意到您希望将其分组成6分钟，因此对解决方案进行了修改。

编辑：

根据评论进行了修订。