Spark MLlib / K-Means 直观解释

Question

Spark MLlib / K-Means 直观解释

scalaapache-sparkmachine-learningk-meansapache-spark-mllib

4

我对机器学习算法和Spark非常陌生。我正在遵循这里找到的Twitter流语言分类器：http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html，具体来说是这段代码：http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala。但我试图在批处理模式下运行它，针对Cassandra中提取的一些推文，本例中总共有200个推文。正如示例所示，我正在使用此对象对一组推文进行“向量化”：

object Utils{
  val numFeatures = 1000
  val tf = new HashingTF(numFeatures)

  /**
   * Create feature vectors by turning each tweet into bigrams of
   * characters (an n-gram model) and then hashing those to a
   * length-1000 feature vector that we can pass to MLlib.
   * This is a common way to decrease the number of features in a
   * model while still getting excellent accuracy (otherwise every
   * pair of Unicode characters would potentially be a feature).
   */
  def featurize(s: String): Vector = {
    tf.transform(s.sliding(2).toSeq)
  }
}

这是我从ExaminAndTrain.scala修改的代码：

 val noSets = rawTweets.map(set => set.mkString("\n"))

val vectors = noSets.map(Utils.featurize).cache()
vectors.count()

val numClusters = 5
val numIterations = 30

val model = KMeans.train(vectors, numClusters, numIterations)

  for (i <- 0 until numClusters) {
    println(s"\nCLUSTER $i")
    noSets.foreach {
        t => if (model.predict(Utils.featurize(t)) == 1) {
          println(t)
        }
      }
    }

这段代码运行后，每个聚类都会打印出“Cluster 0”、“Cluster 1”等，下面不会有任何内容打印。如果我翻转

models.predict(Utils.featurize(t)) == 1

to

models.predict(Utils.featurize(t)) == 0

除了每个聚类下面打印出每条推文，其他事情都是一样的。

这里是我直觉上认为正在发生的事情（如果我错了，请纠正我的想法）：此代码将每个推文转换为向量，随机选择一些聚类，然后运行kmeans以对推文进行分组（在非常高的层面上，我认为聚类应该是常见的“主题”）。因此，当它检查每个推文以查看models.predict == 1时，不同的推文集应该出现在每个聚类下面（并且因为它正在将训练集与自身进行比较，所以每个推文都应该在一个聚类中）。为什么它没有这样做？要么是我对kmeans的理解有误，要么是我的训练集太小，要么是我错过了某个步骤。

非常感谢任何帮助。

- plamb

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- uberwach · Accepted Answer

首先，KMeans是一种聚类算法，因此是无监督的。所以没有“对训练集进行自我检查”的说法（好吧，你可以手动操作）。您的理解实际上相当不错，只是您错过了这一点：model.predict(Utils.featurize(t))给出了由KMeans分配的t所属的簇。我认为您想在代码中检查“models.predict(Utils.featurize(t)) == i”，因为i遍历所有聚类标签。另外一个小提示：特征向量是基于推文的字符的2元模型创建的。这个中间步骤很重要；）2-gram（针对单词）意味着：“熊向熊咆哮” => {(A, bear), (bear, shouts), (shouts, at), (at, a), (a bear)}即“a bear”计算两次。Chars将变成（A，[空格]），（[空格]，b），（b，e）等等。