在R中计算多个变量的实例数

Question

在R中计算多个变量的实例数

4

我有一个大型的数据表Divvy（超过240万条记录），它看起来像这样（一些列已删除）：

X   trip_id     from_station_id.x   to_station_id.x 
 1  1109420     94                  69
 2  1109421     69                  216
 3  1109427     240                 245
 4  1109431     113                 94
 5  1109433     127                 332
 3  1109429     240                 245

我希望能够找到每个站点到对面站点的旅行次数。例如，

From X     To Y     Sum
94         69       1
240        245      2

等等，然后使用dplyr将其与初始表合并，生成以下类似的内容，然后将其限制为不同的起始站点/终点组合，我将使用它来绘制路径（每个站点都有纬度/经度）：

X   trip_id     from_station_id.x   to_station_id.x   Sum 
 1  1109420     94                  69                1
 2  1109421     69                  216               1
 3  1109427     240                 245               2
 4  1109431     113                 94                1
 5  1109433     127                 332               1
 3  1109429     240                 245               1

我成功地使用了count来获取一些信息，例如：

count(Divvy$from_station_id.x==94 & Divvy$to_station_id.x == 69)
  x    freq
1 FALSE 2454553
2  TRUE      81

但是这显然需要大量的人力，因为有300个唯一的站点，所以可能会有超过44k种组合。我创建了一个助手表，想着可以循环它。

n <- select(Divvy, from_station_id.y )

  from_station_id.x 
1                94                
2                69                
3               240               
4               113               
5               113               
6               127               

   count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1])

      x    freq
1 FALSE 2454553
2  TRUE      81

我感觉像一个循环，就像这样：

output <- matrix(ncol=variables, nrow=iterations)


output <- matrix()
for(i in 1:n)(output[i, count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1]))

应该可以工作，但仔细想想，仍然只返回300行，而不是44k，所以它必须循环回去并执行n[2]&n[1]等操作...

我觉得也可能有一个更快的dplyr解决方案，可以让我返回每个组合的计数，并直接附加它，而不需要额外的步骤/表格创建，但我还没有找到它。

我对R较为陌生，我已经搜索过/认为我接近了，但我无法连接那个结果与Divvy。任何帮助将不胜感激。

- ike

我尝试了这三种解决方案，不得不说它们都正确地产生了总和，并以出色的方式工作。我选择了dplyr选项作为“最佳”选项，因为它可以给我想要的有限行数，但我认为data.table选项可能是最优雅的。 - ike

另外，如果有其他人想查看/使用原始数据集，可以在此处找到：http://www.divvybikes.com/data - ike

3个回答

4

既然您说“限制为不同的出发站点和到达站点组合”，以下代码似乎提供了您需要的内容。您的数据被称为mydf。

library(dplyr)
group_by(mydf, from_station_id.x, to_station_id.x) %>%
count(from_station_id.x, to_station_id.x)

#  from_station_id.x to_station_id.x n
#1                69             216 1
#2                94              69 1
#3               113              94 1
#4               127             332 1
#5               240             245 2

- jazzurro

我最终使用了以下代码：counts4 <- group_by(divvydata, trip_id, from_station_id.x, to_station_id.x) %>% count(from_station_id.x, to_station_id.x, From_Station_Lat, From_Station_Long, End_Station_Lat, End_Station_Long) - ike

1

@ike，我很高兴你基于这个建议找到了自己的解决方案。 :) - jazzurro

3

我不确定您期望的结果是否完全是这样，但这可以计算具有相同起点和终点的旅行次数。如果这不完全符合您的期望，请随时评论并让我知道。

dat <- read.table(text="X   trip_id     from_station_id.x   to_station_id.x 
 1  1109420     94                  69
 2  1109421     69                  216
 3  1109427     240                 245
 4  1109431     113                 94
 5  1109433     127                 332
 3  1109429     240                 245", header=TRUE)

dat$from.to <- paste(dat$from_station_id.x, dat$to_station_id.x, sep="-")
freqs <- as.data.frame(table(dat$from.to))
names(freqs) <- c("from.to", "sum")
dat2 <- merge(dat, freqs, by="from.to")
dat2 <- dat2[order(dat2$trip_id),-1]

结果

dat2

#   X trip_id from_station_id.x to_station_id.x sum
# 6 1 1109420                94              69   1
# 5 2 1109421                69             216   1
# 3 3 1109427               240             245   2
# 4 3 1109429               240             245   2
# 1 4 1109431               113              94   1
# 2 5 1109433               127             332   1

- Dominic Comtois

这个确实很好用，谢谢。虽然我把它当作一个read.csv来处理，这样我就可以直接导入文件并跳过一些其他步骤了。谢谢。 - ike

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Metrics · Accepted Answer

#Here is the data.table solution, which is useful if you are working with large data: 
library(data.table)
setDT(DF)[,sum:=.N,by=.(from_station_id.x,to_station_id.x)][] #DF is your dataframe

   X trip_id from_station_id.x to_station_id.x sum
1: 1 1109420                94              69   1
2: 2 1109421                69             216   1
3: 3 1109427               240             245   2
4: 4 1109431               113              94   1
5: 5 1109433               127             332   1
6: 3 1109429               240             245   2