Python - 计算共现矩阵

Question

Python - 计算共现矩阵

9

我正在处理一项自然语言处理任务，需要计算文档中的共现矩阵。基本公式如下:

在这里，我有一个形状为(n, length)的矩阵，其中每一行表示由length个单词组成的句子。所有的句子长度相同，所以总共有n个句子。然后，根据定义好的上下文大小（例如window_size=5），我想要计算共现矩阵D。其中第cth行和wth列的条目是#(w,c)，这意味着一个上下文词c出现在w的上下文中的次数。

可以参考以下示例：如何计算文本窗口内两个单词之间的共现？我知道可以通过嵌套循环来计算它，但我想知道是否存在一种简单的方法或函数？我已经找到了一些答案，但它们无法滑动窗口处理句子。例如：单词共现矩阵因此，是否有Python中可以简洁地处理此问题的函数呢？因为我认为这个任务在NLP领域中非常普遍。

- GEORGE GUO

2个回答

0

我已经使用窗口大小=2计算了共现矩阵

首先编写一个函数，以正确的方式获取邻近单词（这里我使用了getContext）
创建矩阵，如果邻居中存在特定值，则只需添加1。

以下是Python代码：

import numpy as np
CORPUS=["abc def ijk pqr", "pqr klm opq", "lmn pqr xyz abc def pqr abc"]


top2000 = [ "abc","pqr","def"]#list(set((' '.join(ctxs)).split(' ')))
a = np.zeros((3,3), np.int32)
for  sentence in CORPUS:
    for index,word in enumerate(sentence.split(' ')):
       if word in top2000 : 
           print(word)
           context=GetContext(sentence,index)
           print(context)
           for word2 in context:
             if word2 in top2000:
                 a[top2000.index(word)][top2000.index(word2)]+=1
print(a)

获取上下文函数

def GetContext(sentence, index):
words = sentence.split(' ')
ret=[]
for word in words:

        if index==0:
            ret.append(words[index+1])
            ret.append(words[index+2])
        elif index==1:
            ret.append(words[index-1])
            ret.append(words[index+1])
        if len(words)>3:
                ret.append(words[index+2])
        elif index==(len(words)-1):
            ret.append(words[index-2])
            ret.append(words[index-1])
        elif index==(len(words)-2):
            ret.append(words[index-2])
            ret.append(words[index-1])
            ret.append(words[index+1])
        else:
            ret.append(words[index-2])
            ret.append(words[index-1])
            ret.append(words[index+1])
            ret.append(words[index+2])
        return ret

这里是结果：

array([[0, 3, 3],
   [3, 0, 2],
   [3, 2, 0]])

- Shrinivas Ambiger

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Zealseeker · Accepted Answer

我觉得这并不是很复杂。为什么不自己写一个函数呢？首先根据这个教程获取共现矩阵X：http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage 然后，对于每个句子，计算共现并将它们添加到摘要变量中。

m = np.zeros([length,length]) # n is the count of all words
def cal_occ(sentence,m):
    for i,word in enumerate(sentence):
        for j in range(max(i-window,0),min(i+window,length)):
             m[word,sentence[j]]+=1
for sentence in X:
    cal_occ(sentence, m)