

我需要完成一个常见任务,即根据出生日期和任意日期计算年龄(以年、月或周为单位)。然而,我通常需要对许多记录(> 3亿)执行此操作,因此性能是一个关键问题。


  • 使用常见的算术过程(/365.25)(link)
  • 使用包lubridate中的函数new_interval()duration() (link)
  • 使用包eeptools中的函数age_calc() (link, link, link)


# Some toy birthdates
birthdate <- as.Date(c("1978-12-30", "1978-12-31", "1979-01-01", 
                       "1962-12-30", "1962-12-31", "1963-01-01", 
                       "2000-06-16", "2000-06-17", "2000-06-18", 
                       "2007-03-18", "2007-03-19", "2007-03-20", 
                       "1968-02-29", "1968-02-29", "1968-02-29"))

# Given dates to calculate the age
givendate <- as.Date(c("2015-12-31", "2015-12-31", "2015-12-31", 
                       "2015-12-31", "2015-12-31", "2015-12-31", 
                       "2050-06-17", "2050-06-17", "2050-06-17",
                       "2008-03-19", "2008-03-19", "2008-03-19", 
                       "2015-02-28", "2015-03-01", "2015-03-02"))

# Using a common arithmetic procedure ("Time differences in days"/365.25)

# Use the package lubridate
new_interval(start = birthdate, end = givendate) / 
                     duration(num = 1, units = "years")

# Use the package eeptools
age_calc(dob = birthdate, enddate = givendate, units = "years")


# Now let's compare the performance of the alternatives using microbenchmark
mbm <- microbenchmark(
    arithmetic = (givendate - birthdate) / 365.25,
    lubridate = new_interval(start = birthdate, end = givendate) /
                                     duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate, 
                        units = "years"),
    times = 1000

# And examine the results


Microbenchmark results - table Microbenchmark results - plot


"由于现代公历的构造方式,没有一种简单的算术方法可以根据常见用法(常见用法意味着一个人的年龄应该始终是一个整数,在生日上精确增加)"。 (链接)





toy_df <- data.frame(
    birthdate = birthdate,
    givendate = givendate,
    arithmetic = as.numeric((givendate - birthdate) / 365.25),
    lubridate = new_interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate,
                        units = "years")
toy_df[, 3:5] <- floor(toy_df[, 3:5])

    birthdate  givendate arithmetic lubridate eeptools
1  1978-12-30 2015-12-31         37        37       37
2  1978-12-31 2015-12-31         36        37       37
3  1979-01-01 2015-12-31         36        37       36
4  1962-12-30 2015-12-31         53        53       53
5  1962-12-31 2015-12-31         52        53       53
6  1963-01-01 2015-12-31         52        53       52
7  2000-06-16 2050-06-17         50        50       50
8  2000-06-17 2050-06-17         49        50       50
9  2000-06-18 2050-06-17         49        50       49
10 2007-03-18 2008-03-19          1         1        1
11 2007-03-19 2008-03-19          1         1        1
12 2007-03-20 2008-03-19          0         1        0
13 1968-02-29 2015-02-28         46        47       46
14 1968-02-29 2015-03-01         47        47       47
15 1968-02-29 2015-03-02         47        47       47

如果有比lubridate更快/更简单的东西,我会感到惊讶。如果你真的需要提高性能,我的建议是首先使用算术方法,然后再使用lubridate方法重新处理所有“接近的调用”(例如,如果 abs(floor(age) - age) < 0.01,那么使用 lubridate)。 - Señor O
谢谢。您是否是lubridate的经验用户?正如我在编辑后的问题中所提到的,我发现它会出错(也许比算术方法更多),但我已经在几篇帖子中读到,lubridate确实是能够准确计算年龄的R包之一。所以现在我想知道我是否做错了什么。(我认为没有,我基本上是按照示例进行操作,这很简单,但只是再次确认一下) - Hernando Casas
使用 difftime(givendate, birthdate) / 365.25 而不是 (givendate - birthdate) / 365.25) 可以快约5%,但这不是一种解决方案。如果你需要进行算术运算,这可能会有用。请注意,不要改变原文的意思。 - Molx
@Molx 好呼叫! -.Date 更健壮地调用 difftime。更快的应该是 (unclass(givendate) - unclass(birthdate)) / 365.25,因为它可以跳过 difftime 的开销。 - MichaelChirico

as.period(interval(start = birthdate, end = givendate))

 "37y 0m 1d 0H 0M 0S"   
 "37y 0m 0d 0H 0M 0S"   
 "36y 11m 30d 0H 0M 0S" 
 "46y 11m 30d 1H 0M 0S" 
 "47y 0m 0d 1H 0M 0S"   
 "47y 0m 1d 1H 0M 0S" 

as.period(interval(start = birthdate, end = givendate))$year
lubridate::year(as.period(interval(start = birthdate, end = givendate)))

 [1] 37 37 36 53 53 52 50 50 49  1  1  0 46 47 47 

> mbm
Unit: microseconds
       expr       min        lq       mean    median         uq        max neval cld
 arithmetic   116.595   138.149   181.7547   184.335   196.8565   5556.306  1000  a 
  lubridate 16807.683 17406.255 20388.1410 18053.274 21378.8875 157965.935  1000   b

new_interval()lubridate包中已经被弃用,所以请使用interval()。此外,由于as.period()导致的“对因子无意义”的警告消息已得到解决,因此不再出现。 - wjchulme



age <- function(from, to) {
    from_lt = as.POSIXlt(from)
    to_lt = as.POSIXlt(to)

    age = to_lt$year - from_lt$year

    ifelse(to_lt$mon < from_lt$mon |
               (to_lt$mon == from_lt$mon & to_lt$mday < from_lt$mday),
           age - 1, age)



mbm <- microbenchmark(
    arithmetic = (givendate - birthdate) / 365.25,
    lubridate = interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate, 
                        units = "years"),
    age = age(from = birthdate, to = givendate),
    times = 1000

enter image description here enter image description here

在我的例子中,它没有犯任何错误(在任何例子中也不应该犯错; 这是一个相当直接使用ifelse的函数)。

toy_df <- data.frame(
    birthdate = birthdate,
    givendate = givendate,
    arithmetic = as.numeric((givendate - birthdate) / 365.25),
    lubridate = interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate,
                        units = "years"),
    age = age(from = birthdate, to = givendate)
toy_df[, 3:6] <- floor(toy_df[, 3:6])

    birthdate  givendate arithmetic lubridate eeptools age
1  1978-12-30 2015-12-31         37        37       37  37
2  1978-12-31 2015-12-31         36        37       37  37
3  1979-01-01 2015-12-31         36        37       36  36
4  1962-12-30 2015-12-31         53        53       53  53
5  1962-12-31 2015-12-31         52        53       53  53
6  1963-01-01 2015-12-31         52        53       52  52
7  2000-06-16 2050-06-17         50        50       50  50
8  2000-06-17 2050-06-17         49        50       50  50
9  2000-06-18 2050-06-17         49        50       49  49
10 2007-03-18 2008-03-19          1         1        1   1
11 2007-03-19 2008-03-19          1         1        1   1
12 2007-03-20 2008-03-19          0         1        0   0
13 1968-02-29 2015-02-28         46        47       46  46
14 1968-02-29 2015-03-01         47        47       47  47
15 1968-02-29 2015-03-02         47        47       47  47


  1. 我想等待 @Jim 把它作为答案发出来。
  2. 我会等待看看是否有其他人提出了完整的解决方案(高效、准确且可以按需要生成年龄、月份或周数)。



return(difftime(e1, e2, units = "days"))

因此,类 Date 的对象上的 "arithmetic" 方法实际上是 difftime 函数的包装器。那么 difftime 呢?如果你想要原始速度,它也有很多开销。
关键在于,Date 对象存储为自 1970 年 1 月 1 日以来/直到整数天数(尽管它们实际上不是存储为 integer,因此在 data.table 中诞生了 IDate 类),因此我们可以只需减去这些天数即可完成操作,但为避免调用 -.Date 方法,我们必须对输入进行 unclass 处理。
(unclass(birthdate) - unclass(givendate)) / 365.25



NN <- 1e6
birthdate <- as.Date(sprintf('%d-%02d-%02d',
                             sample(1901:2030, NN, TRUE),
                             sample(12, NN, TRUE),
                             sample(28, NN, TRUE)))

#average 30 years, most data between 20 and 40 years
givendate <- birthdate + as.integer(rnorm(NN, mean = 10950, sd = 1000))


  arithmetic = (givendate - birthdate) / 365.25,
  lubridate = interval(start = birthdate, end = givendate) /
    duration(num = 1, units = "years"),
  age = age(from = birthdate, to = givendate),
  fastar = (unclass(givendate) - unclass(birthdate)) / 365.25,
  overlaps = get_age(birthdate, givendate),
  times = 50)
# Unit: milliseconds
#        expr        min         lq      mean     median         uq      max neval  cld
#  arithmetic  28.153465  30.384639  62.96118  31.492764  34.052991 180.9556    50  b  
#   lubridate  94.327968  97.233009 157.30420 102.751351 240.717065 265.0283    50   c 
#         age 338.347756 479.598513 483.84529 483.580981 488.090832 770.1149    50    d
#      fastar   7.740098   7.831528  11.02521   7.913146   8.090902 153.3645    50 a   
#    overlaps 316.408920 458.734073 459.58974 463.806255 470.320072 769.0929    50    d


我一直在钻研这个问题,最终得到了一个解决方案:a) 完全准确*(与迄今为止提出的所有其他选项相比);b) 相当快(请参阅我的另一个答案中的基准测试)。它依赖于我手动进行的一堆算术和来自data.table包的精彩foverlaps函数。
该方法的实质是从Date的整数表示开始工作,并认识到所有出生日期都落在四个1461(= 365 * 4 + 1)天循环中的一个,具体取决于下一年何时需要366天才能到达您的生日。
get_age <- function(birthdays, ref_dates){
  x <- data.table(bday <- unclass(birthdays),
                  #rem: how many days has it been since the lapse of the
                  #  most recent quadrennium since your birth?
                  rem = ((ref <- unclass(ref_dates)) - bday) %% 1461)
  #cycle_type: which of the four years following your birthday
  #  was the one that had 366 days? 
  x[ , cycle_type := 
       foverlaps(data.table(start = bdr <- bday %% 1461L, end = bdr),
                 #these intervals were calculated by hand;
                 #  e.g., 59 is Feb. 28, 1970. I made the judgment
                 #  call to say that those born on Feb. 29 don't
                 #  have their "birthday" until the following March 1st.
                 data.table(start = c(0L, 59L, 424L, 790L, 1155L), 
                            end = c(58L, 423L, 789L, 1154L, 1460L), 
                            val = c(3L, 2L, 1L, 4L, 3L),
                            key = "start,end"))$val]
  I4 <- diag(4L)[ , -4L] #for conciseness below
  #The `by` approach might seem a little abstruse for those
  #  not familiar with `data.table`; see the edit history
  #  for a more palatable version (which is also slightly slower)
  x[ , extra := 
       foverlaps(data.table(start = rem, end = rem),
                 data.table(start = st <- cumsum(c(0L, rep(365L, 3L) +
                            end = c(st[-1L] - 1L, 1461L),
                            int_yrs = 0:3, key = "start,end")
       )[ , int_yrs + (i.start - start) / (end + 1L - start)], by = cycle_type]
  #grand finale -- 4 years for every quadrennium, plus the fraction:
  4L * ((ref - bday) %/% 1461L) + x$extra


toy_df <- data.frame(
  birthdate = birthdate,
  givendate = givendate,
  arithmetic = as.numeric((givendate - birthdate) / 365.25),
  lubridate = interval(start = birthdate, end = givendate) /
    duration(num = 1, units = "years"),
  eeptools = age_calc(dob = birthdate, enddate = givendate,
                      units = "years"),
  mine = get_age(birthdate, givendate)

#     birthdate  givendate arithmetic lubridate   eeptools       mine
# 1  1978-12-30 2015-12-31 37.0020534 37.027397 37.0027397 37.0027322 #eeptools wrong: will be 366 days until 12/31/16, so fraction is 1/366
# 2  1978-12-31 2015-12-31 36.9993155 37.024658 37.0000000 37.0000000
# 3  1979-01-01 2015-12-31 36.9965777 37.021918 36.9972603 36.9972603
# 4  1962-12-30 2015-12-31 53.0020534 53.038356 53.0027397 53.0027322 #same problem
# 5  1962-12-31 2015-12-31 52.9993155 53.035616 53.0000000 53.0000000
# 6  1963-01-01 2015-12-31 52.9965777 53.032877 52.9972603 52.9972603
# 7  2000-06-16 2050-06-17 50.0013689 50.035616 50.0000000 50.0027397 #eeptools wrong: not exactly the birthday
# 8  2000-06-17 2050-06-17 49.9986311 50.032877 50.9972603 50.0000000 #eeptools wrong: _is_ exactly the birthday
# 9  2000-06-18 2050-06-17 49.9958932 50.030137 49.9945205 49.9972603 #eeptools wrong: fraction should be 364/365
# 10 2007-03-18 2008-03-19  1.0047912  1.005479  1.0027322  1.0027397 #eeptools wrong: 2/29 already passed, only 365 days until 3/19/2009
# 11 2007-03-19 2008-03-19  1.0020534  1.002740  1.0000000  1.0000000
# 12 2007-03-20 2008-03-19  0.9993155  1.000000  0.9966839  0.9972678 #eeptools wrong: we passed 2/29, so should be 365/366
# 13 1968-02-29 2015-02-28 46.9979466 47.030137 46.9977019 46.9972603 #my judgment: birthday occurs on 3/1 for 2/29 babies, so 364/365 the way there
# 14 1968-02-29 2015-03-01 47.0006845 47.032877 47.0000000 47.0000000
# 15 1968-02-29 2015-03-02 47.0034223 47.035616 47.0027397 47.0027322


我还在使用base功能方面取得了很大进展,但是a)它相当丑陋(需要对0-1460进行非线性转换以避免嵌套的ifelse语句等)b)最终无法避免使用一个for循环(以应用整个日期列表),所以我认为那会使事情变慢太多。(变换公式是x1 =(unclass(birthdays) - 59)%% 1461; x2 = x1 *(729-x1)/ 402232 + x1,供后人参考)


* (对于范围为非闰世纪的日期; 我认为处理此类日期的扩展不应该太繁琐)

对我来说,似乎这是唯一有效的解决方案。干得好! - Evan O.

网页内容由stack overflow 提供, 点击上面的