在R中将一个列从出生日期更改为年龄

21

我第一次使用data.table。

我的表中有大约400,000个年龄数据,我需要将它们从出生日期转换为年龄。

最佳方法是什么?


对于其他人来说,我在另一个帖子中找到了最有帮助的答案:https://dev59.com/XHA65IYBdhLWcg3w1SRC#25450756 - Jaccar
11个回答

39

我一直在思考这个问题并对迄今为止给出的两个答案不满意。我喜欢使用lubridate,就像@KFB所做的那样,但我也希望以我使用eeptools包的答案中一样,将所有内容都封装到一个函数中。因此,这里有一个使用lubridate间隔方法和一些不错选项的封装函数:

#' Calculate age
#' 
#' By default, calculates the typical "age in years", with a
#' \code{floor} applied so that you are, e.g., 5 years old from
#' 5th birthday through the day before your 6th birthday. Set
#' \code{floor = FALSE} to return decimal ages, and change \code{units}
#' for units other than years.
#' @param dob date-of-birth, the day to start calculating age.
#' @param age.day the date on which age is to be calculated.
#' @param units unit to measure age in. Defaults to \code{"years"}. Passed to \link{\code{duration}}.
#' @param floor boolean for whether or not to floor the result. Defaults to \code{TRUE}.
#' @return Age in \code{units}. Will be an integer if \code{floor = TRUE}.
#' @examples
#' my.dob <- as.Date('1983-10-20')
#' age(my.dob)
#' age(my.dob, units = "minutes")
#' age(my.dob, floor = FALSE)
age <- function(dob, age.day = today(), units = "years", floor = TRUE) {
    calc.age = lubridate::interval(dob, age.day) / lubridate::duration(num = 1, units = units)
    if (floor) return(as.integer(floor(calc.age)))
    return(calc.age)
}

使用示例:

> my.dob <- as.Date('1983-10-20')

> age(my.dob)
[1] 31

> age(my.dob, floor = FALSE)
[1] 31.15616

> age(my.dob, units = "minutes")
[1] 16375680

> age(seq(my.dob, length.out = 6, by = "years"))
[1] 31 30 29 28 27 26

这就是我一直在寻找的答案。(我们再次相遇) - Ben
警告信息: 'new_interval'已被弃用,请改用'interval'。自版本'1.5.0'起已弃用。 - malajisi
3
这段内容涉及到生日的问题。例如,age(dob = as.Date(“1970-06-01”),age.day = as.Date(“2018-05-31”))(一个人48岁生日前一天)应该返回47岁,但实际上它返回了48(在floor=FALSE时为48.03014)。可能有更简洁的方法,但as.numeric(as.period(interval(as.Date("1970-06-01"), as.Date("2018-05-31"))), "years")似乎更好(它返回47.9988)。 - Hobo
这是否考虑了闰年的天数?它似乎将时间间隔除以固定的365天,但并非每年都有365天。 - Bravoking
1
你能否添加 @import lubridate 或在函数中添加 lubridate::,以便人们可以方便地重用? - Brandon Rose MD MPH

31

这篇博客文章的评论中,我找到了在eeptools包中的age_calc函数。它处理了边缘情况(闰年等),检查了输入并且看起来非常健壮。

library(eeptools)
x <- as.Date(c("2011-01-01", "1996-02-29"))
age_calc(x[1],x[2]) # default is age in months
[1] 46.73333 224.83118
age_calc(x[1],x[2], units = "years") # but you can set it to years

[1] 3.893151 18.731507

floor(age_calc(x[1],x[2], units = "years"))

[1] 3 18

针对您的数据

yourdata$age <- floor(age_calc(yourdata$birthdate, units = "years"))

假设你想以整数年龄为单位。


7
假设您有一个数据表(data.table),您可以进行以下操作:
library(data.table)
library(lubridate)
# toy data
X = data.table(birth=seq(from=as.Date("1970-01-01"), to=as.Date("1980-12-31"), by="year"))
Sys.Date()

选项1:使用lubridate包中的“as.period”。
X[, age := as.period(Sys.Date() - birth)][]
         birth                   age
 1: 1970-01-01  44y 0m 327d 0H 0M 0S
 2: 1971-01-01  43y 0m 327d 6H 0M 0S
 3: 1972-01-01 42y 0m 327d 12H 0M 0S
 4: 1973-01-01 41y 0m 326d 18H 0M 0S
 5: 1974-01-01  40y 0m 327d 0H 0M 0S
 6: 1975-01-01  39y 0m 327d 6H 0M 0S
 7: 1976-01-01 38y 0m 327d 12H 0M 0S
 8: 1977-01-01 37y 0m 326d 18H 0M 0S
 9: 1978-01-01  36y 0m 327d 0H 0M 0S
10: 1979-01-01  35y 0m 327d 6H 0M 0S
11: 1980-01-01 34y 0m 327d 12H 0M 0S

选项2:如果您不喜欢选项1的格式,可以按以下方式操作:
yr = duration(num = 1, units = "years")
X[, age := new_interval(birth, Sys.Date())/yr][]
# you get
         birth      age
 1: 1970-01-01 44.92603
 2: 1971-01-01 43.92603
 3: 1972-01-01 42.92603
 4: 1973-01-01 41.92329
 5: 1974-01-01 40.92329
 6: 1975-01-01 39.92329
 7: 1976-01-01 38.92329
 8: 1977-01-01 37.92055
 9: 1978-01-01 36.92055
10: 1979-01-01 35.92055
11: 1980-01-01 34.92055

我认为选项2更为可取。

1
选项2在生日方面存在问题 - 请参见@Gregor的答案中我的评论。 以具体示例为例,yr = duration(num = 1, units = "years"); birth <- as.Date("1970-06-01"); age_as_at <- as.Date("2018-05-31"); interval(birth, age_as_at)/yr 应小于48。 - Hobo

3
我希望实现的功能不会增加除了data.table以外的依赖项,通常data.table是我的唯一依赖项。 data.table仅用于mday,即月份中的日期。

开发函数

这个函数逻辑上是我如何考虑某人的年龄。我从[当前年份] - [出生年份] - 1开始,然后如果他们已经过了今年的生日,则加1。为了检查这个偏移量,我首先考虑月份,然后(如果必要)考虑日期。

以下是逐步实现的步骤:

agecalc <- function(origin, current){
    require(data.table)
    y <- year(current) - year(origin) - 1
    offset <- 0
    if(month(current) > month(origin)) offset <- 1
    if(month(current) == month(origin) & 
       mday(current) >= mday(origin)) offset <- 1
    age <- y + offset
    return(age)
}

生产函数

这是相同逻辑的重构和向量化:

agecalc <- function(origin, current){
    require(data.table)
    age <- year(current) - year(origin) - 1
    ii <- (month(current) > month(origin)) | (month(current) == month(origin) & 
                                                  mday(current) >= mday(origin))
    age[ii] <- age[ii] + 1
    return(age)
}

使用字符串的实验性功能

您也可以对月份/日期部分进行字符串比较。也许有时这更有效,例如如果年份是数字而出生日期是字符串。

agecalc_strings <- function(origin, current){
    origin <- as.character(origin)
    current <- as.character(current)
    
    age <- as.numeric(substr(current, 1, 4)) - as.numeric(substr(origin, 1, 4)) - 1
    if(substr(current, 6, 10) >= substr(origin, 6, 10)){
        age <- age + 1
    }
    return(age)
}

对矢量化的“生产”版本进行了一些测试:

## Examples for specific dates to test the calculation with things like 
## beginning and end of months, and leap years:
agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1985-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-12"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-08-13"))
agecalc(as.IDate("1985-08-13"), as.IDate("1986-09-12"))

agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2000-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2001-03-01"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-28"))
agecalc(as.IDate("2000-02-29"), as.IDate("2004-02-29"))
agecalc(as.IDate("2000-02-29"), as.IDate("2011-03-01"))

## Testing every age for every day over several years
## This test requires vectorized version:
d <- data.table(d=as.IDate("2000-01-01") + 0:10000)
d[ , b1 := as.IDate("2000-08-15")]
d[ , b2 := as.IDate("2000-02-29")]
d[ , age1_num := (d - b1) / 365]
d[ , age2_num := (d - b2) / 365]
d[ , age1 := agecalc(b1, d)]
d[ , age2 := agecalc(b2, d)]
d

以下是年龄作为数字和整数的微不足道的情节。正如您所看到的,整数年龄是一种与数字年龄的直线相切(但在其下方)的阶梯状图案。
plot(numeric_age1 ~ today, dt, type = "l", 
     ylab = "ages", main = "ages plotted")
lines(integer_age1 ~ today, dt, col = "blue")

ages


1

我更倾向于使用lubridate包,借鉴我最初在另一个post中遇到的语法来完成这项任务。

为了以R日期对象的形式标准化输入日期,最好使用lubridate::mdy()lubridate::ymd()或类似的函数(如适用)。您可以使用interval()函数生成描述两个日期之间经过的时间间隔的区间,然后使用duration()函数定义应如何"切块"该区间。

下面概述了使用R中最新语法计算两个日期之间年龄的最简单情况。

df$DOB <- mdy(df$DOB)
df$EndDate <- mdy(df$EndDate)
df$Calc_Age <- interval(start= df$DOB, end=df$EndDate)/                      
                     duration(n=1, unit="years")

使用R语言的'floor()`函数,可以将年龄四舍五入到最接近的整数。

df$Calc_AgeF <- floor(df$Calc_Age)

或者,在基本的R round()函数中,digits=参数可用于向上或向下舍入,并指定返回值中的精确小数位数,如下所示:

df$Calc_Age2 <- round(df$Calc_Age, digits = 2) ## 2 decimals
df$Calc_Age0 <- round(df$Calc_Age, digits = 0) ## nearest integer

值得注意的是,一旦输入的日期经过上述计算步骤(即interval()duration()函数),返回的值将是数值,不再是R中的日期对象。这很重要,因为lubridate::floor_date()严格限于日期时间对象。
无论输入日期在data.table还是data.frame对象中出现,上述语法都适用。

这与其他关于生日的答案存在相同的问题。 - Hobo

0
(Sys.Date() - yourDate) / 365.25

1
不错但不是100%健壮。 - s_baldur
我认为.25部分并不重要,但这对于闰年生日会失败。此外,您需要使用“trunc”来获取整数年龄。 - geneorama
1
你只解决了一个单独的问题,并没有尝试回答问题中提出的编程问题。请考虑修改。 - OTStats

0

在计算月龄或年龄时,处理闰年的情况,我对所有的回答都不满意,所以这是我使用lubridate包编写的函数。

基本上,它将fromto之间的时间间隔切成(最多)每年一块,并根据该块是否为闰年调整时间间隔。总时间间隔是每个块的年龄之和。

library(lubridate)

#' Get Age of Date relative to Another Date
#'
#' @param from,to the date or dates to consider
#' @param units the units to consider
#' @param floor logical as to whether to floor the result
#' @param simple logical as to whether to do a simple calculation, a simple calculation doesn't account for leap year.
#' @author Nicholas Hamilton
#' @export
age <- function(from, to = today(), units = "years", floor = FALSE, simple = FALSE) {

  #Account for Leap Year if Working in Months and Years
  if(!simple && length(grep("^(month|year)",units)) > 0){
    df = data.frame(from,to)
    calc = sapply(1:nrow(df),function(r){

      #Start and Finish Points
      st = df[r,1]; fn = df[r,2]

      #If there is no difference, age is zero
      if(st == fn){ return(0) }

      #If there is a difference, age is not zero and needs to be calculated
      sign = +1 #Age Direction
      if(st > fn){ tmp = st; st = fn; fn = tmp; sign = -1 } #Swap and Change sign

      #Determine the slice-points
      mid   = ceiling_date(seq(st,fn,by='year'),'year')

      #Build the sequence
      dates = unique( c(st,mid,fn) )
      dates = dates[which(dates >= st & dates <= fn)]

      #Determine the age of the chunks
      chunks = sapply(head(seq_along(dates),-1),function(ix){
        k = 365/( 365 + leap_year(dates[ix]) )
        k*interval( dates[ix], dates[ix+1] ) / duration(num = 1, units = units)
      })

      #Sum the Chunks, and account for direction
      sign*sum(chunks)
    })

  #If Simple Calculation or Not Months or Not years
  }else{
    calc = interval(from,to) / duration(num = 1, units = units)
  }

  if (floor) calc = as.integer(floor(calc))
  calc
}

0
请注意,特别是在生日周围年份为2000的情况下,eeptools包中的age_calc会失败。
以下是一些在age_calc中无法正常工作的示例:
library(lubridate)
library(eeptools)
age_calc(ymd("1997-04-21"), ymd("2000-04-21"), units = "years")
age_calc(ymd("2000-04-21"), ymd("2019-04-21"), units = "years")
age_calc(ymd("2000-04-21"), ymd("2016-04-21"), units = "years")

一些其他的解决方案在闰年时计算出来的结果对于我想得到的十进制年龄不是很直观。我喜欢@James_D的解决方案,它既精确又简洁,但我想要的是计算完整年龄加上从他们上一个生日到下一个生日完成的年份分数的东西(这将根据年份为365或366天)。在闰年的情况下,我使用lubridate的回滚函数,在2月29日之后的非闰年中使用3月1日。我使用了@ geneorama的一些测试用例,并添加了一些我的测试用例,输出与我期望的一致。
library(lubridate)

# Calculate precise age from birthdate in ymd format
age_calculation <- function(birth_date, later_year) {
  if (birth_date > later_year)
  {
    stop("Birth date is after the desired date!")
  }
  # Calculate the most recent birthday of the person based on the desired year
  latest_bday <- ymd(add_with_rollback(birth_date, years((year(later_year) - year(birth_date))), roll_to_first = TRUE))
  # Get amount of days between the desired date and the latest birthday
  days_between <- as.numeric(days(later_year - latest_bday), units = "days")
  # Get how many days are in the year between their most recent and next bdays
  year_length <- as.numeric(days((add_with_rollback(latest_bday, years(1), roll_to_first = TRUE)) - latest_bday), units = "days")
  # Get the year fraction (amount of year completed before next birthday)
  fraction_year <- days_between/year_length
  # Sum the difference of years with the year fraction
  age_sum <- (year(later_year) - year(birth_date)) + fraction_year
  return(age_sum)
}

test_list <- list(c("1985-08-13", "1986-08-12"),
                    c("1985-08-13", "1985-08-13"),
                    c("1985-08-13", "1986-08-13"),
                    c("1985-08-13", "1986-09-12"),
                    c("2000-02-29", "2000-02-29"),
                    c("2000-02-29", "2000-03-01"),
                    c("2000-02-29", "2001-02-28"),
                    c("2000-02-29", "2004-02-29"), 
                    c("2000-02-29", "2011-03-01"),
                    c("1997-04-21", "2000-04-21"),
                    c("2000-04-21", "2016-04-21"),
                    c("2000-04-21", "2019-04-21"),
                    c("2017-06-15", "2018-04-30"),
                    c("2019-04-20", "2019-08-24"),
                    c("2020-05-25", "2021-11-25"),
                    c("2020-11-25", "2021-11-24"),
                    c("2020-11-24", "2020-11-25"),
                    c("2020-02-28", "2020-02-29"),
                    c("2020-02-29", "2020-02-28"))
  
for (i in 1:length(test_list))
{
  print(paste0("Dates from ", test_list[[i]][1], " to ", test_list[[i]][2]))
  result <- age_calculation(ymd(test_list[[i]][1]), ymd(test_list[[i]][2]))
  print(result)
}

输出:

[1] "Dates from 1985-08-13 to 1986-08-12"
[1] 0.9972603
[1] "Dates from 1985-08-13 to 1985-08-13"
[1] 0
[1] "Dates from 1985-08-13 to 1986-08-13"
[1] 1
[1] "Dates from 1985-08-13 to 1986-09-12"
[1] 1.082192
[1] "Dates from 2000-02-29 to 2000-02-29"
[1] 0
[1] "Dates from 2000-02-29 to 2000-03-01"
[1] 0.00273224
[1] "Dates from 2000-02-29 to 2001-02-28"
[1] 0.9972603
[1] "Dates from 2000-02-29 to 2004-02-29"
[1] 4
[1] "Dates from 2000-02-29 to 2011-03-01"
[1] 11
[1] "Dates from 1997-04-21 to 2000-04-21"
[1] 3
[1] "Dates from 2000-04-21 to 2016-04-21"
[1] 16
[1] "Dates from 2000-04-21 to 2019-04-21"
[1] 19
[1] "Dates from 2017-06-15 to 2018-04-30"
[1] 0.8739726
[1] "Dates from 2019-04-20 to 2019-08-24"
[1] 0.3442623
[1] "Dates from 2020-05-25 to 2021-11-25"
[1] 1.50411
[1] "Dates from 2020-11-25 to 2021-11-24"
[1] 0.9972603
[1] "Dates from 2020-11-24 to 2020-11-25"
[1] 0.002739726
[1] "Dates from 2020-02-28 to 2020-02-29"
[1] 0.00273224
[1] "Dates from 2020-02-29 to 2020-02-28"
Error in age_calculation(ymd(test_list[[i]][1]), ymd(test_list[[i]][2])) : 
  Birth date is after the desired date!

正如其他人所说,trunc函数非常适合获取整数年龄。


0
这是一种使用lubridate的(我认为更简单的)解决方案:
library(lubridate)

age <- function(dob, on.day=today()) {
    intvl <- interval(dob, on.day)
    prd <- as.period(intvl)
    return(prd@year)
}

0

一种非常简单的计算两个日期之间年龄的方法,而不需要使用任何额外的包可能是:

df$age = with(df, as.Date(date_2, "%Y-%m-%d") - as.Date(date_1, "%Y-%m-%d"))

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接