TfidfVectorizer中如何计算词频?

3
我很努力地想理解这个问题,但是我无能为力。我知道TfidfVectorizer默认会在词频上应用l2规范化。这篇文章讲解了它的方程式。我正在使用TfidfVectorizer来处理古吉拉特语文本,以下是有关它的输出详细信息:
我的两个文档如下:
ખુબ વખાણ કરે છે

ખુબ વધારે છે

我使用的代码是:
vectorizer = TfidfVectorizer(tokenizer=tokenize_words, sublinear_tf=True, use_idf=True, smooth_idf=False)

这里,tokenize_words 是我用来分词的函数。 我的数据的 TF-IDF 列表如下:

[[ 0.6088451   0.35959372  0.35959372  0.6088451   0.        ]
 [ 0.          0.45329466  0.45329466  0.          0.76749457]]

功能列表如下:
['કરે', 'ખુબ', 'છે.', 'વખાણ', 'વધારે']

idf的价值:

{'વખાણ': 1.6931471805599454, 'છે.': 1.0, 'કરે': 1.6931471805599454, 'વધારે': 1.6931471805599454, 'ખુબ': 1.0}

请您解释一下,在这个例子中,每个文档中的每个单词应该有什么样的词频。

您可以参考http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting。 - Vivek Kumar
我也参考了那个。规范化后无法获取值。 - Himadri
请在此处发布您已显示TF-IDF的原始数据...它有两个文档。 - Vivek Kumar
@VivekKumar 感谢您的及时回复。我已经更新了我的问题,添加了两个文档的文本。 - Himadri
1个回答

4

好的,现在让我们逐步了解我在评论中提供的文档

文档:

`ખુબ વખાણ કરે છે
 ખુબ વધારે છે`
  1. Get all unique terms (features): ['કરે', 'ખુબ', 'છે.', 'વખાણ', 'વધારે']
  2. Calculate frequency of each term in documents:-

    a. Each term present in document1 [ખુબ વખાણ કરે છે] is present once, and વધારે is not present.`

    b. So the term frequency vector (sorted according to features): [1 1 1 1 0]

    c. Applying steps a and b on document2, we get [0 1 1 0 1]

    d. So our final term-frequency vector is [[1 1 1 1 0], [0 1 1 0 1]]

    Note: This is the term frequency you want

  3. Now find IDF (This is based on features, not on document basis):

    idf(term) = log(number of documents/number of documents with this term) + 1

    1 is added to the idf value to prevent zero divisions. It is governed by "smooth_idf" parameter which is True by default.

    idf('કરે') = log(2/1)+1 = 0.69314.. + 1 = 1.69314..
    
    idf('ખુબ') = log(2/2)+1 = 0 + 1 = 1
    
    idf('છે.') = log(2/2)+1 = 0 + 1 = 1
    
    idf('વખાણ') = log(2/1)+1 = 0.69314.. + 1 = 1.69314..
    
    idf('વધારે') = log(2/1)+1 = 0.69314.. + 1 = 1.69314..
    

    Note: This corresponds to the data you showed in question.

  4. Now calculate TF-IDF (This again is calculated document-wise, calculated according to sorting of features):

    a. For document1:

     For 'કરે', tf-idf = tf(કરે) x idf(કરે) = 1 x 1.69314 = 1.69314
    
     For 'ખુબ', tf-idf = tf(કરે) x idf(કરે) = 1 x 1 = 1
    
     For 'છે.', tf-idf = tf(કરે) x idf(કરે) = 1 x 1 = 1
    
     For 'વખાણ', tf-idf = tf(કરે) x idf(કરે) = 1 x 1.69314 = 1.69314
    
     For 'વધારે', tf-idf = tf(કરે) x idf(કરે) = 0 x 1.69314 = 0
    

    So for document1, the final tf-idf vector is [1.69314 1 1 1.69314 0]

    b. Now normalization is done (l2 Euclidean):

    dividor = sqrt(sqr(1.69314)+sqr(1)+sqr(1)+sqr(1.69314)+sqr(0))
             = sqrt(2.8667230596 + 1 + 1 + 2.8667230596 + 0)
             = sqrt(7.7334461192)
             = 2.7809074272977876...
    

    Dividing each element of the tf-idf array with dividor, we get:

    [0.6088445 0.3595948 0.3595948548 0.6088445 0]

    Note: This is the tfidf of firt document you posted in question.

    c. Now do the same steps a and b for document 2, we get:

    [ 0. 0.453294 0.453294 0. 0.767494]

更新:关于 sublinear_tf = True OR False

您原始的词频向量是[[1 1 1 1 0],[0 1 1 0 1]],您正确地理解了使用sublinear_tf = True将更改词频向量。

new_tf = 1 + log(tf)

现在,上面的代码只适用于term-frequecny中非零元素。因为对于0,log(0)是未定义的。而且所有非零条目都是1。log(1)是0,1 + log(1) = 1 + 0 = 1.你可以看到值对于值为1的元素将保持不变。所以你的new_tf = [[1 1 1 1 0],[0 1 1 0 1]] = tf(original)。你的term frequency由于sublinear_tf而发生了变化,但它仍然保持不变。因此,如果你使用sublinear_tf=Truesublinear_tf=False,所有下面的计算都将是相同的,并且输出也是相同的。现在,如果你更改包含除1和0之外的元素的term-frequency向量的文档,你将使用sublinear_tf来获得差异。希望你的疑惑现在已经解决了。

谢谢。但有一个疑惑。我已经设置了 sublinear_tf = True。这意味着 tf 将被计算为 1 + log(tf)。这是正确的吗? - Himadri
请发布您用于查找tfidf的整个代码。 - Vivek Kumar
你是否使用了 smooth_idf=False - Vivek Kumar
是的,smooth_idf=False。这一点已经明白了。我只是对sublinear_tf有些困惑。在这个参数中,无论是True还是False,值都没有被改变。我已经在我的问题中添加了一行代码。 - Himadri
哦,我真是太蠢了。这个很明显啊。我的错。谢谢Vivek。他解决了我的疑惑。 - Himadri

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接