根据直接和间接关系对个人进行分组。

3

我希望基于地址和地块所有权将个人分组到不同的家庭中。如果人们住在相同的地址并且直接或间接拥有至少一个地块的所有权,则他们属于同一户人家。

个人之间的链接可以是直接的,即两个人共享一个地块,但链接也可以是间接的,通过交叉连成链 - 两个人共享一个地块,并且其中一个人与其他人之一也共享一个地块,并且所有人都住在同一地址。

以下是一些例子:

  • 如果一个人(9)独自居住在他的地址(C),即使另一个人(6)也拥有他或她的地块(s),他也会独自拥有家庭。
  • 如果两个人(12和13)住在同一地址(F)并拥有相同的地块(w),则他们属于同一户人家。但是,如果三个人住在同一地址(B),但只有两个人(7和8)拥有相同的地块(r),第三个人(6)住在这个地址(B)但拥有另一个地块(m),那么只有拥有相同地块的两个人来自同一户人家。
  • 如果在同一地址(A),有4个人住(1、2、3和4),如果人们(1、2和3)通过拥有几个地块(m、n和o)相连,则他们属于同一户人家,而同时也住在这个地址的人(4)没有拥有这3个地块之一而是另一个(p)的所有权,因此不属于同一户人家。

我有三个变量:地址ID、所有者ID和地块ID。我想得到一个家庭号。以下是一个示例表格:

 id_address id_owner id_parcel id_household
          A        1         m            1
          A        1         n            1
          A        2         n            1
          A        2         o            1
          A        3         o            1
          A        4         p            2
          A        5         q            3
          B        6         s            4
          B        7         r            5
          B        8         r            5
          C        9         s            6
          D       10         t            7
          E       11         u            8
          E       11         v            8
          F       12         w            9
          F       13         w            9

我的第一反应是循环,但是我有 800,000 行数据,这可能会花费太长时间。

以下是示例数据,其中“id_household”是我想创建的变量:

structure(list(id_address = c("A", "A", "A", "A", "A", "A", "A", 
"B", "B", "B", "C", "D", "E", "E", "F", "F"), id_owner = c(1L, 
1L, 2L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 11L, 12L, 13L
), id_parcel = c("m", "n", "n", "o", "o", "p", "q", "s", "r", 
"r", "s", "t", "u", "v", "w", "w"), id_household = c(1L, 1L, 
1L, 1L, 1L, 2L, 3L, 4L, 5L, 5L, 6L, 7L, 8L, 8L, 9L, 9L)), class = "data.frame", row.names = c(NA, 
-16L))

我认为可以在这里使用 igraph::components。例如,请参见识别链接的一组连续剧集,以及其中的链接。但是,您需要按组(即“id_address”)进行操作,并将成员资格加入原始数据。 - Henrik
另一个类似的问题:跟踪字段更改的函数。我敢打赌有人能找到一个合适的规范答案 ;) - Henrik
这里提出了什么问题/问题? - TylerH
2个回答

3
您的问题可以看作是一个图形问题,igraph包提供了出色的工具。列'id_owner'和'id_parcel'可以被视为边缘列表。函数components提供了“每个顶点所属的簇标识”。我使用data.table进行通用数据处理。
library(data.table)
library(igraph)

setDT(d)
d2 = d[ , {
  # create graph. columns id_owner and id_parcel are treated as an edge list.
  g = graph_from_data_frame(.SD)

  # get components of the graph that are directly or indirectly connected
  mem = components(g)$membership

  # grab the memberships and their names (i.e. the vertices) 
  .(id_parcel = names(mem), mem = mem)
 
  # do the above for each id_address
}, by = id_address]

# join the memberships to the original data
# paste with id_address for uniqueness
d[d2, on = .(id_address, id_parcel), id := paste0(id_address, mem)]

# if you want a consecutive integer as 'id', to make it agree with 'id_household'
d[ , id2 := as.integer(as.factor(id))]

输出:

d
#     id_address id_owner id_parcel id_household id id2
#  1:          A        1         m            1 A1   1
#  2:          A        1         n            1 A1   1
#  3:          A        2         n            1 A1   1
#  4:          A        2         o            1 A1   1
#  5:          A        3         o            1 A1   1
#  6:          A        4         p            2 A2   2
#  7:          A        5         q            3 A3   3
#  8:          B        6         s            4 B1   4
#  9:          B        7         r            5 B2   5
# 10:          B        8         r            5 B2   5
# 11:          C        9         s            6 C1   6
# 12:          D       10         t            7 D1   7
# 13:          E       11         u            8 E1   8
# 14:          E       11         v            8 E1   8
# 15:          F       12         w            9 F1   9
# 16:          F       13         w            9 F1   9

一种避免使用 by 操作的替代方法。但是,添加了其他一些步骤,因此其效率取决于数据的结构。

首先,创建“组合变量”,其中地址与包裹和所有者分别连接在一起。创建membership。通过拆分名称 (tstrsplit(names(mem), "_", fixed = TRUE)) 检索原始列。将成员资格加入到原始数据中。

d[ , `:=`(
  address_parcel = paste(id_address, id_parcel, sep = "_"),
  address_owner = paste(id_address, id_owner, sep = "_"))]

d2 = d[ , {
  g = graph_from_data_frame(.SD[ , .(address_owner, address_parcel)])
  mem = components(g)$membership
  c(tstrsplit(names(mem), "_", fixed = TRUE), .(mem = mem))
}]

d[d2, on = c(id_address = "V1", id_parcel = "V2"), id_hh := mem]

输出:

d
#     id_address id_owner id_parcel id_household id_hh
#  1:          A        1         m            1     1
#  2:          A        1         n            1     1
#  3:          A        2         n            1     1
#  4:          A        2         o            1     1
#  5:          A        3         o            1     1
#  6:          A        4         p            2     2
#  7:          A        5         q            3     3
#  8:          B        6         s            4     4
#  9:          B        7         r            5     5
# 10:          B        8         r            5     5
# 11:          C        9         s            6     6
# 12:          D       10         t            7     7
# 13:          E       11         u            8     8
# 14:          E       11         v            8     8
# 15:          F       12         w            9     9
# 16:          F       13         w            9     9

在更大的数据上测量两种方案的性能(原始数据重复1e4次并在每个块内创建新的ID)。对于这个特定的数据,第二个方案避免使用by,速度约快100倍。

# prepare toy data
d1 = as.data.table(d)
n = 1e4
dL = d1[rep(1:.N, n)]

# make unique id within the repeated data frames
dL[ , `:=`(
  id_address = paste(rep(1:n, each = nrow(d1)), sep = ".", id_address),
  id_owner = paste(rep(1:n, each = nrow(d1)), sep = ".", id_owner),
  id_parcel = paste(rep(1:n, each = nrow(d1)), sep = ".", id_parcel)
)]

替代方案1:按地址 by

dL1 = copy(dL) 

system.time({
d2 = dL1[ , {
  g = graph_from_data_frame(.SD)
  mem = components(g)$membership
  .(id_parcel = names(mem), mem = mem)
}, by = id_address]

dL1[d2, on = .(id_address, id_parcel), id_hh := paste(id_address, mem, sep = "_")]
})

#  user  system elapsed 
# 59.46    7.68   67.11

备选方案2. 组合变量和拆分:

dL2 = copy(dL)

system.time({
dL2[ , `:=`(
  address_parcel = paste(id_address, id_parcel, sep = "_"),
  address_owner = paste(id_address, id_owner, sep = "_"))]

d3 = dL2[ , {
  g = graph_from_data_frame(.SD[ , .(address_owner, address_parcel)])
  mem = components(g)$membership
  c(tstrsplit(names(mem), "_", fixed = TRUE), .(mem = mem))
}]

dL2[d3, on = c(id_address = "V1", id_parcel = "V2"), id_hh := mem]
})

# user  system elapsed 
# 0.47    0.24    0.57

测试相等性:

all.equal(as.integer(as.factor(stringi::stri_pad_left(dL1$id_hh, 9, "0"))),
          dL2$id_hh) 
# TRUE

你的代码在我的大约 800,000 行数据集上运行了约 7 分钟。 - Hugo Périlleux Sanchez
@HugoPérilleuxSanchez 请注意,我更新了一个非“by”替代方案,它在由重复您的原始数据组成的玩具数据上至少更快。 - Henrik
如果您能在真实数据上测试新的替代方案,那将非常有趣。它是否有效?您是否发现了类似的速度提升?干杯! - Henrik
1
在我的数据集上,大约5秒钟就完成了。谢谢!我现在会尝试理解你的代码! - Hugo Périlleux Sanchez
1
感谢您的反馈!如果您需要澄清或发现任何缺陷,请不要犹豫,随时提问。祝愉快! - Henrik

0
一位朋友同事用Fortran解决了这个问题,非常快(在排序文件和编译程序后不到一秒钟):
      program Nohousehold_
      character*1 Passer
      character*13 id_address,id_address1,id_address0,id_addressBL
      character*15 id_owner,id_owner1
      character*23 id_parcel
      character*15 L_owner(1000)
      character*28 L_owner_parcel(1000,1000)
      Integer No_owner, Nbparcel(1000)
      Integer Nohouseholdowner(1000)
      Integer NohouseholdTT
      id_address0='0 0          '
      id_addressBL='             '

      NohouseholdTT=0

      oownern(1,file='data_a.txt')
      oownern(2,file='RESULT3.TXT')
      Read(1,19)Passer             
 19   format(a1)

 81   read(1,11)id_address,id_owner,id_parcel
      if(id_address.eq.id_addressBL)go to 81
      id_owner1='ZZZZZZZZZZZZZZZ'
 82   continue
      if(id_address.eq.id_address0)then
        if(id_owner.ne.id_owner1)then
          NohouseholdTT=NohouseholdTT+1
          write(2,22)id_owner,NohouseholdTT
          id_owner1=id_owner
        endif
        read(1,11)id_address,id_owner,id_parcel
        go to 82
       else
        go to 83
      endif
     
 83   continue

      n=1
 11   format(a13,1x,a15,1x,a28)
 22   format(a15,1x,i10,1x,a28)

      id_address1=id_address
      id_owner1=id_owner
      No_owner=1
      L_owner(1)=id_owner
      Nbparcel(1)=1
      L_owner_parcel(1,1)=id_parcel
      Nohouseholdowner(1)=0
      n=1
      nFini=0
 14   continue
      n=n+1
      read(1,11,end=90)id_address,id_owner,id_parcel
      if(id_address.eq.id_address1)then
         if(id_owner.eq.id_owner1)then
            Nbparcel(No_owner)=Nbparcel(No_owner)+1
            L_owner_parcel(No_owner,Nbparcel(No_owner))=id_parcel
          else
            No_owner=No_owner+1
            L_owner(No_owner)=id_owner
            Nbparcel(No_owner)=1
            L_owner_parcel(No_owner,1)=id_parcel
            Nohouseholdowner(No_owner)=0
            id_owner1=id_owner
         endif
       ELSE
 556     continue
         do 1 Noowner1=1,No_owner 
          if(Nohouseholdowner(Noowner1).eq.0)then
            NohouseholdTT=NohouseholdTT+1
            Nohouseholdowner(Noowner1)=NohouseholdTT
            write(2,22)L_owner(Noowner1),NohouseholdTT
          endif
          do 2 Noowner2=(Noowner1+1),No_owner   
            do 201 Nbparcel1=1,Nbparcel(Noowner1)
              do 202 Nbparcel2=1,Nbparcel(Noowner2)
               if(L_owner_parcel(Noowner1,Nbparcel1).eq.L_owner_parcel(Noowner2,Nbparcel2))then
               if(Nohouseholdowner(Noowner2).eq.0)then
                Nohouseholdowner(Noowner2)=Nohouseholdowner(Noowner1)
                write(2,22)L_owner(Noowner2),NohouseholdTT
                go to 203
               endif
               endif
 202          continue
 201        continue 
 203        continue
 2        continue
 1       continue

         if(NFini.eq.1)go to 91

         id_address1=id_address
         id_owner1=id_owner
         No_owner=1
         L_owner(1)=id_owner
         Nbparcel(1)=1
         L_owner_parcel(1,1)=id_parcel
         Nohouseholdowner(1)=0

      endif
      go to 14
 90   continue
      NFini=1
      go to 556
 91   continue

      close(1)
      close(2)

      end


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接