如何在R中生成自增ID

Question

如何在R中生成自增ID

12

我正在寻找一种有效的方法来为我所生成的一些合成数据创建独特的数字ID。

目前，我只有一个函数，它从全局变量中发出并递增一个值（请参见以下演示代码）。但是，这很麻烦，因为我必须初始化idCounter变量，如果可能的话我宁愿不使用全局变量。

# Emit SSN
idCounter = 0
emitID = function(){
  # Turn into a formatted string
  id = formatC(idCounter,width=9,flag=0,format="d")

  # Increment id counter
  idCounter <<- idCounter+1

  return(id)
}
record$id = emitID()

uuid包提供了我所需要的功能, 但我需要的ID仅为整数。有什么建议吗？也许可以将UUID值转换为某种数字值的方式？显然会发生一些冲突，但可能没关系。我认为，最多只需要10亿个值。

感谢任何建议！

- Rob

2

矛盾的要求：“唯一”，“显然会发生一些冲突，但那可能还可以”。 - Mitch Wheat

2个回答

5

我喜欢使用proto包进行小型面向对象编程。在底层，它使用环境变量以类似于Martin Morgan所示的方式。

# this defines your class
library(proto)
Counter <- proto(idCounter = 0L)
Counter$emitID <- function(self = .) {
   id <- formatC(self$idCounter, width = 9, flag = 0, format = "d")
   self$idCounter <- self$idCounter + 1L
   return(id)
}

# This creates an instance (or you can use `Counter` directly as a singleton)
mycounter <- Counter$proto()

# use it:
mycounter$emitID()
# [1] "000000000"
mycounter$emitID()
# [1] "000000001"

- flodel

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martin Morgan · Accepted Answer

计数器的非全局版本使用词法作用域来封装idCounter和增量函数。

emitID <- local({
    idCounter <- -1L
    function(){
        idCounter <<- idCounter + 1L                     # increment
        formatC(idCounter, width=9, flag=0, format="d")  # format & return
    }
})

然后

> emitID()
[1] "000000000"
> emitID1()
[1] "000000001"
> idCounter <- 123   ## global variable, not locally scoped idCounter
> emitID()
[1] "000000002"

一种有趣的替代方案是使用“工厂”模式创建独立计数器。您的问题暗示您将调用此函数十亿次（嗯，不确定我从哪里得到这个印象...），因此也许通过创建id缓冲区来向量化对formatC的调用是有意义的。

idFactory <- function(buf_n=1000000) {
    curr <- 0L
    last <- -1L
    val <- NULL
    function() {
        if ((curr %% buf_n) == 0L) {
            val <<- formatC(last + seq_len(buf_n), width=9, flag=0, format="d")
            last <<- last + buf_n
            curr <<- 0L
        }
        val[curr <<- curr + 1L]
    }
}
emitID2 <- idFactory()

然后(emitID1是上面本地变量版本的实例)。

> library(microbenchmark)
> microbenchmark(emitID1(), emitID2(), times=100000)
Unit: microseconds
      expr    min     lq median     uq      max neval
 emitID1() 66.363 70.614 72.310 73.603 13753.96 1e+05
 emitID2()  2.240  2.982  4.138  4.676 49593.03 1e+05
> emitID1()
[1] "000100000"
> emitID2()
[1] "000100000"

（尽管速度并不是一切，但原型解决方案大约比 emitID1 慢3倍）。