在R中使用余弦距离进行层次聚类

Question

在R中使用余弦距离进行层次聚类

rtmhierarchical-clustering

3

我想使用R编程语言对文档语料库进行余弦相似度聚类，但是我遇到了以下错误：

Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") : missing value where TRUE/FALSE needed

我该怎么办？

为了重现这个错误，这里有一个例子：

library(tm)
doc <- c( "The sky is blue.", "The sun is bright today.", "The sun in the sky is bright.", "We can see the shining sun, the bright sun." )
doc_corpus <- Corpus( VectorSource(doc) )
control_list <- list(removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE)
tdm <- TermDocumentMatrix(doc_corpus, control = control_list)



tf <- as.matrix(tdm)
( idf <- log( ncol(tf) / ( 1 + rowSums(tf != 0) ) ) )
( idf <- diag(idf) )
tf_idf <- crossprod(tf, idf)
colnames(tf_idf) <- rownames(tf)

tf_idf

cosine_dist = 1-crossprod(tf_idf) /(sqrt(colSums(tf_idf^2)%*%t(colSums(tf_idf^2))))
cluster1 <- hclust(cosine_dist, method = "ward.D")

然后我收到了这个错误：

如果（is.na（n）|| n > 65536L），则出错：停止（“大小不能为NA且不超过65536”）：需要TRUE / FALSE的缺失值

- noura salah

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- phiver · Accepted Answer

有两个问题：

1: cosine_dist = 1-crossprod(tf_idf) /(sqrt(colSums(tf_idf^2)%*%t(colSums(tf_idf^2)))) 会产生 NaN，因为你除以了0。

2: hclust 需要一个距离对象，而不仅仅是矩阵。详见?hclust。

这两个问题可以通过以下代码解决：

.....
cosine_dist = 1-crossprod(tf_idf) /(sqrt(colSums(tf_idf^2)%*%t(colSums(tf_idf^2))))

# remove NaN's by 0
cosine_dist[is.na(cosine_dist)] <- 0

# create dist object
cosine_dist <- as.dist(cosine_dist)

cluster1 <- hclust(cosine_dist, method = "ward.D")

plot(cluster1)