高效的Jaccard相似度DocumentTermMatrix

9
我希望有一种方法可以高效地计算tm :: DocumentTermMatrix文件之间的Jaccard相似度。我可以使用slam包中的类似方法来计算余弦相似度,如此答案所示。我在CrossValidated上遇到了一个与R相关但关于矩阵代数而不是最有效路线的另一个问题和回答。我尝试使用更有效的slam函数实现该解决方案,但与将DTM强制转换为矩阵并使用proxy :: dist时获得的解决方案不同。 我该如何在R中高效地计算大型DocumentTermMatrix文件之间的Jaccard相似度?
#数据和软件包
library(Matrix);library(proxy);library(tm);library(slam);library(Matrix)

mat <- structure(list(i = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 
    2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), j = c(1L, 1L, 2L, 2L, 3L, 3L, 
    4L, 4L, 4L, 5L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), v = c(1, 
    1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1), nrow = 4L, 
        ncol = 12L, dimnames = structure(list(Docs = c("1", "2", 
        "3", "4"), Terms = c("computer", "is", "fun", "not", "too", 
        "no", "it's", "dumb", "what", "should", "we", "do")), .Names = c("Docs", 
        "Terms"))), .Names = c("i", "j", "v", "nrow", "ncol", "dimnames"
    ), class = c("DocumentTermMatrix", "simple_triplet_matrix"), weighting = c("term frequency", 
    "tf"))

#低效计算(期望输出)

proxy::dist(as.matrix(mat), method = 'jaccard')

##       1     2     3
## 2 0.000            
## 3 0.875 0.875      
## 4 1.000 1.000 1.000

#我的尝试

A <- slam::tcrossprod_simple_triplet_matrix(mat)
im <- which(A > 0, arr.ind=TRUE)
b <- slam::row_sums(mat)
Aim <- A[im]

stats::as.dist(Matrix::sparseMatrix(
      i = im[,1],
      j = im[,2],
      x = Aim / (b[im[,1]] + b[im[,2]] - Aim),
      dims = dim(A)
))

##     1   2   3
## 2 2.0        
## 3 0.1 0.1    
## 4 0.0 0.0 0.0

输出结果不匹配。

以下是原始文本供参考:

c("Computer is fun. Not too fun.", "Computer is fun. Not too fun.", 
    "No it's not, it's dumb.", "What should we do?")

我预期元素1和2之间的距离为0,元素3比元素1和4更接近元素1(我预期最远距离为0,因为没有共享单词),这在proxy::dist解决方案中可以看到。
编辑
请注意,即使在中等大小的DTM上,矩阵也会变得非常庞大。以下是使用vegan包的示例。注意需要4分钟来解决,而余弦相似性只需要约5秒钟。
library(qdap); library(quanteda);library(vegan);library(slam)
x <- quanteda::convert(quanteda::dfm(rep(pres_debates2012$dialogue), stem = FALSE, 
        verbose = FALSE, removeNumbers = FALSE), to = 'tm')


## <<DocumentTermMatrix (documents: 2912, terms: 3368)>>
## Non-/sparse entries: 37836/9769780
## Sparsity           : 100%
## Maximal term length: 16
## Weighting          : term frequency (tf)

tic <- Sys.time()
jaccard_dist_mat <- vegan::vegdist(as.matrix(x), method = 'jaccard')
Sys.time() - tic #Time difference of 4.01837 mins

tic <- Sys.time()
tdm <- t(x)
cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
Sys.time() - tic #Time difference of 5.024992 secs
3个回答

4

Jaccard测量是用于衡量集合之间的测量标准,输入矩阵应为二进制第一行中提到:

## common values:
A = tcrossprod(m)

对于词袋模型的 DTM,这不是共同值的数量!

library(text2vec)
library(magrittr)
library(Matrix)

jaccard_similarity <- function(m) {
  A <- tcrossprod(m)
  im <- which(A > 0, arr.ind=TRUE, useNames = F)
  b <- rowSums(m)
  Aim <- A[im]
  sparseMatrix(
    i = im[,1],
    j = im[,2],
    x = Aim / (b[im[,1]] + b[im[,2]] - Aim),
    dims = dim(A)
  )
}

jaccard_distance <- function(m) {
  1 - jaccard_similarity(m)
}

cosine <- function(m) {
  m_normalized <- m / sqrt(rowSums(m ^ 2))
  tcrossprod(m_normalized)
}

基准测试:

data("movie_review")
tokens <- movie_review$review %>% tolower %>% word_tokenizer

dtm <- create_dtm(itoken(tokens), hash_vectorizer(hash_size = 2**16))
dim(dtm)
# 5000 65536

system.time(dmt_cos <- cosine(dtm))
# user  system elapsed 
#  2.524   0.169   2.693 

system.time( {
  dtm_binary <- transform_binary(dtm)
  # or simply
  # dtm_binary <- sign(dtm)
  dtm_jac <- jaccard_similarity(dtm_binary)  
})
#   user  system elapsed 
# 11.398   1.599  12.996
max(dtm_jac)
# 1
dim(dtm_jac)
# 5000 5000

EDIT 2016-07-01:

请参见来自text2vec 0.4的更快版本(如果不需要从dgCMatrix转换为dgTMatrix,速度提升约2.85倍,如果需要列主要的dgCMatrix,则提升约1.75倍)。

jaccard_dist_text2vec_04 <- function(x, y = NULL, format = 'dgCMatrix') {
  if (!inherits(x, 'sparseMatrix'))
    stop("at the moment jaccard distance defined only for sparse matrices")
  # union x
  rs_x = rowSums(x)
  if (is.null(y)) {
    # intersect x
    RESULT = tcrossprod(x)
    rs_y = rs_x
  } else {
    if (!inherits(y, 'sparseMatrix'))
      stop("at the moment jaccard distance defined only for sparse matrices")
    # intersect x y
    RESULT = tcrossprod(x, y)
    # union y
    rs_y = rowSums(y)
  }
  RESULT = as(RESULT, 'dgTMatrix')
  # add 1 to indices because of zero-based indices in sparse matrices
  # 1 - (...) because we calculate distance, not similarity
  RESULT@x <- 1 - RESULT@x / (rs_x[RESULT@i + 1L] + rs_y[RESULT@j + 1L] - RESULT@x)
  if (!inherits(RESULT, format))
    RESULT = as(RESULT, format)
  RESULT
}
system.time( {
   dtm_binary <- transform_binary(dtm)
   dtm_jac <-jaccard_dist(dtm_binary, format = 'dgTMatrix')
 })
 #  user  system elapsed 
 # 4.075   0.517   4.593  
system.time( {
   dtm_binary <- transform_binary(dtm)
   dtm_jac <-jaccard_dist(dtm_binary, format = 'dgCMatrix')
 })
 #  user  system elapsed 
 # 6.571   0.939   7.516

我不确定我理解了你的评论。我的回答到底有什么问题呢?它可以产生正确的Jaccard相似度,并且运行非常快。 - Dmitriy Selivanov
抱歉如果我的评论看起来太粗鲁了。已经调整了答案。 - Dmitriy Selivanov
非常感谢,PS:喜欢text2vec的新添加功能。 - Tyler Rinker
1
查看我的编辑。从即将发布的text2vec 0.4中获得更快的Jaccard距离。 - Dmitriy Selivanov

3

那么,vegan包中的vegdist()怎么样呢?它使用C代码编写,并且大约比proxy快10倍。

library(vegan)
vegdist(as.matrix(mat), method = 'jaccard')
##    1   2   3
## 2 0.0        
## 3 0.9 0.9    
## 4 1.0 1.0 1.0

library(microbenchmark)
matt <- as.matrix(mat)
microbenchmark(proxy::dist(matt, method = 'jaccard'),
               vegdist(matt, method = 'jaccard'))

## Unit: microseconds
##                                   expr      min        lq      mean
##  proxy::dist(matt, method = "jaccard") 4879.338 4995.2755 5133.9305
##      vegdist(matt, method = "jaccard")  587.935  633.2625  703.8335
##    median       uq      max neval
##  5069.203 5157.520 7549.346   100
##   671.466  723.569 1305.357   100

这是一个玩具示例。如果你将它扩展到更大的TermDocumentMatrix,素食主义者在速度上也会受到影响。请参见我在OP中的时间测量。 - Tyler Rinker

1

使用 stringdist 包中的 stringdistmatrix 函数,并使用 nthread 选项以并行方式运行,可以大大提高速度。平均比使用余弦相似性测试慢六秒。

library(qdap)
library(slam)
library(stringdist)
data(pres_debates2012)

x <- quanteda::convert(quanteda::dfm(rep(pres_debates2012$dialogue), stem = FALSE, 
                                     verbose = FALSE, removeNumbers = FALSE), to = 'tm')

tic <- Sys.time()
tdm <- t(x)
cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
Sys.time() - tic #Time difference of 4.069233 secs

tic <- Sys.time()
t <- stringdistmatrix(pres_debates2012$dialogue, method = "jaccard", nthread = 4)
Sys.time() - tic #Time difference of 10.18158 secs

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接