在R中查找跨小列集合具有重复值的行

Question

在R中查找跨小列集合具有重复值的行

3

假设我有一个 data.table，其中包含一个 id 以及其他四列整数值。如何高效地找到在这四个其他列中至少有两个值相同的行？

fooTbl = data.table(id = c('a', 'b'), ind1=c(1,2), ind2=c(3,4), ind3=c(2,3), ind4=c(2,1))
fooTbl
#    id ind1 ind2 ind3 ind4
# 1:  a    1    3    2    2
# 2:  b    2    4    3    1

我已经有两个解决方案了。第一个比第二个快得多，但第一个需要硬编码所有组合并检查它们的相等性。这似乎是不可取的，并且随着列数的增加，维护起来也很困难：

fooTbl[, uniq := (ind1 != ind2 & ind1 != ind3 & ind1 != ind4 & ind2 != ind3 & ind2 != ind4 & ind3 != ind4)]
fooTbl
#    id ind1 ind2 ind3 ind4  uniq
# 1:  a    1    3    2    2 FALSE
# 2:  b    2    4    3    1  TRUE

第二种方法是使用data.table并对表的长格式进行操作。这种方法更易于维护（不需要硬编码所有组合），但速度较慢：

fooTbl[, uniq := NULL]
fooTbl
#    id ind1 ind2 ind3 ind4
# 1:  a    1    3    2    2
# 2:  b    2    4    3    1
fooTbl = melt(fooTbl, measure=c('ind1', 'ind2', 'ind3', 'ind4'))
fooTbl
#    id variable value
# 1:  a     ind1     1
# 2:  b     ind1     2
# 3:  a     ind2     3
# 4:  b     ind2     4
# 5:  a     ind3     2
# 6:  b     ind3     3
# 7:  a     ind4     2
# 8:  b     ind4     1
fooTbl[, N := length(unique(value)), by=id]
fooTbl[, uniq := N == 4][, N := NULL]
fooTbl
   id variable value  uniq
1:  a     ind1     1 FALSE
2:  b     ind1     2  TRUE
3:  a     ind2     3 FALSE
4:  b     ind2     4  TRUE
5:  a     ind3     2 FALSE
6:  b     ind3     3  TRUE
7:  a     ind4     2 FALSE
8:  b     ind4     1  TRUE
fooTbl = dcast(fooTbl, id + uniq ~ variable, value.var='value')
fooTbl
  id  uniq ind1 ind2 ind3 ind4
1  a FALSE    1    3    2    2
2  b  TRUE    2    4    3    1

有没有一种方法可以在不硬编码所有检查组合的情况下获取第一个（宽）解决方案的速度？

对于我的实际表，N是可管理的（~3M），但足够大以感受到第二个解决方案中的操作负担。

- Clayton Stanley

你说得对，在这里不应该融合或聚合。那么，只构建一个表达式并进行“eval”操作怎么样？ - Arun

可能这就是正确的方法。老实说，我一直以为只是缺少某些基本的R函数来完成这个任务。 - Clayton Stanley

3个回答

1

我最终选择了 @Arun 的建议，通过编程构建表达式并对其进行评估。这里是一个特定于data.table的实现。我不得不采用字符串操作（而不是仅使用bquote操作符号），所以它不太干净，但它可以工作。

allColUniqExpr <- function(colNames, resColName) {
    makeExpr = function(x) sprintf('%s != %s', x[1], x[2])
    expr = apply(combn(colNames, 2), 2, makeExpr)
    expr = paste(expr, sep='', collapse=' & ')
    expr = sprintf('%s := %s', resColName, expr)
    expr = parse(text=expr)
    expr
}

使用方法：

fooTbl[, eval(allColUniqExpr(c('ind1', 'ind2', 'ind3', 'ind4'), 'uniq'))]

- Clayton Stanley

0

这里还有另一种可能性：

fooTbl$uniq = apply(fooTbl[,2:ncol(fooTbl)],1,function(x) {any(duplicated(x))})

- BigFinger

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- sds · Accepted Answer

这里假设每行数据的唯一标识为id:

> (ind <- paste0("ind",1:4))
[1] "ind1" "ind2" "ind3" "ind4"
> fooTbl[,u := length(ind) == length(unique(unlist(.SD))),by="id", .SDcols = ind]

或者

> fooTbl[,u := !any(duplicated(unlist(.SD))),by="id", .SDcols = ind]

或者不使用 by ：

> fooTbl[, u := apply(.SD,1,function(x) !any(duplicated(x))), .SDcols = ind]

现在：

> fooTbl
   id ind1 ind2 ind3 ind4     u
1:  a    1    3    2    2 FALSE
2:  b    2    4    3    1  TRUE