如何使用PySpark获取与最高tf-idf对应的单词?

3

我看到过类似的帖子,但没有完整的答案,所以在这里发帖。

我正在使用Spark中的TF-IDF来获取文档中具有最大tf-idf值的单词。我使用以下代码:

from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemover

tokenizer = Tokenizer(inputCol="doc_cln", outputCol="tokens")
remover1 = StopWordsRemover(inputCol="tokens", 
outputCol="stopWordsRemovedTokens")

stopwordList =["word1","word2","word3"]

remover2 = StopWordsRemover(inputCol="stopWordsRemovedTokens", 
outputCol="filtered" ,stopWords=stopwordList)

hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=2000)

idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[tokenizer, remover1, remover2, hashingTF, idf])

model = pipeline.fit(df)

results = model.transform(df)
results.cache()

我得到的结果类似于

|[a8g4i9g5y, hwcdn] |(2000,[905,1104],[7.34977707433047,7.076179741760428]) 

在哪里

filtered: array (nullable = true)
features: vector (nullable = true)

如何从“feature”中提取数组?理想情况下,我想获得与最高tfidf对应的单词,如下所示。
|a8g4i9g5y|7.34977707433047

事先感谢您!


1
如果我错了,请纠正我:您不能假设单词a8g4i9g5y与特征905相关联,因此具有tf-idf值为7.34977707433047。哈希过程不一定保持该特定句子中单词的顺序。您只能确定a8g4i9g5yhwcdn之一由列905表示,而另一个由1104表示。 - ldavid
1个回答

7
  1. Your feature column has type vector from package pyspark.ml.linalg. It could be either

    1. pyspark.ml.linalg.DenseVector (source), e.g. DenseVector([1., 2.])
    2. pyspark.ml.linalg.SparseVector (source), e.g. SparseVector(4, [1, 3], [3.0, 4.0])
  2. Based on the data you have (2000,[905,1104],[7.34977707433047,7.076179741760428]), apparently it's SparseVector, and it could be broken down to 3 main components:

    • size: 2000
    • indices: [905,1104]
    • values: [7.34977707433047,7.076179741760428]
  3. And what you're looking for is the property values of that vector.

  4. With the other 'literal' PySpark SQL type such as StringType or IntegerType, you can access its properties (and aggregation functions) using SQL functions package (docs). However vector is not literal SQL type and the only way to access its properties is through an UDF, like so:

    # Important: `vector.values` returns ndarray from numpy.
    # PySpark doesn't understand ndarray, therefore you'd want to 
    # convert it to normal Python list using `tolist`
    def extract_values_from_vector(vector):
        return vector.values.tolist()
    
    # Just a regular UDF
    def extract_values_from_vector_udf(col):
        return udf(extract_values_from_vector, ArrayType(DoubleType()))
    
    # And use that UDF to get your values
    results.select(extract_values_from_vector_udf('features'), 'features')
    

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接