NLP、spaCy: 改善文档相似度的策略

7

背景简介: 我有一些自动转录的讲座文本数据,我想比较它们的内容相似度(例如他们在谈论什么),以进行聚类和推荐。我对NLP非常新手。


数据: 我正在使用的数据可在此处获取。对于所有懒人

clone https://github.com/TMorville/transcribed_data

这是一个将其放入df中的代码片段:

import os, json
import pandas as pd

from pandas.io.json import json_normalize 

def td_to_df():
    
    path_to_json = '#FILL OUT PATH'
    json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('td.json')]

    tddata = pd.DataFrame(columns=['trans', 'confidence'])

    for index, js in enumerate(json_files):
        with open(os.path.join(path_to_json, js)) as json_file:
            json_text = json_normalize(json.load(json_file))

            tddata['trans'].loc[index] = str(json_text['trans'][0])
            tddata['confidence'].loc[index] = str(json_text['confidence'][0])

    return tddata

方法:目前,我只使用了spaCy软件包来进行“开箱即用”的相似度评估。我只需要将nlp模型应用于整个文本集合,并将其与其他所有文本进行比较。

def similarity_get():
    
    tddata = td_to_df()
    
    nlp = spacy.load('en_core_web_lg')
    
    baseline = nlp(tddata.trans[0])
    
    for text in tddata.trans:
        print (baseline.similarity(nlp(text)))

问题: 几乎所有的相似性都大于0.95。这与基线更多或更少无关。现在,考虑到缺乏预处理,这可能并不令人惊讶。


解决方案策略: 根据此帖子中的建议,我想要做以下几件事(尽可能使用spaCy):1)去除停用词。2)去除最常见的单词。3)合并单词对。4)可能在spaCy之外使用Doc2Vec。


问题: 上述方案是否可行?如果不是,还缺少什么?如果是,使用nlp = spacy.load('en_core_web_lg')加载的预训练模型已经实现了多少?

我似乎找不到说明这些模型确切操作及如何配置它的文档。一个快速Google搜索没有结果,即使是非常好的API文档也似乎无法提供帮助。也许我找错地方了?


关于你提出的问题,使用类似doc2vec的方法似乎是一个合理的方法。这些方法通常依赖于单词的向量嵌入,这些向量是由word2vec或GloVe生成的。因此,对于最后一个问题的答案是,通过加载nlp = spacy.load('en_core_web_lg'),你可以加载将用于doc2vec的单词向量。 - Ali Zarezade
1个回答

6
你可以使用SpaCY和一些正则表达式来完成大部分工作。
因此,你需要查看文档中的SpaCY API。
任何NLP流程的基本步骤如下:
  1. Language detection (self explanatory, if you're working with some dataset, you know what the language is and you can adapt your pipeline to that). When you know a language you have to download a correct models from SpaCY. The instructions are here. Let's use English for this example. In your command line just type python -m spacy download en and then import it to the preprocessing script like this:

    import spacy
    nlp = spacy.load('en')
    
  2. Tokenization - this is the process of splitting the text into words. It's not enough to just do text.split() (ex. there's would be treated as a single word but it's actually two words there and is). So here we use Tokenizers. In SpaCy you can do something like:

    nlp_doc = nlp(text)
    

其中 text 是您的数据集语料库或数据集样本。您可以在此处阅读有关文档实例的更多信息。

  1. Punctuation removal - pretty self explanatory process, done by the method in the previous step. To remove punctuation, just type:

    import re
    
    # removing punctuation tokens
    text_no_punct = [token.text for token in doc if not token.is_punct]
    
    # remove punctuation tokens that are in the word string like 'bye!' -> 'bye'
    REPLACE_PUNCT = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])")
    text_no_punct = [REPLACE_PUNCT.sub("", tok.text) for tok in text_no_punct]
    
  2. POS tagging - short for Part-Of-Speech tagging. It is the process of marking up a word in a text as corresponding to a particular part of speech. For example:

    A/DT Part-Of-Speech/NNP Tagger/NNP is/VBZ a/DT piece/NN of/IN
    software/NN that/WDT reads/VBZ text/NN in/IN some/DT
    language/NN and/CC assigns/VBZ parts/NNS of/IN speech/NN to/TO
    each/DT word/NN ,/, such/JJ as/IN noun/NN ,/, verb/NN ,/,
    adjective/NN ,/, etc./FW./.
    

斜杠后面的大写代码是标准词语标签。可以在这里找到标签列表。

在SpaCy中,将文本放入nlp实例中就已经完成了此操作。您可以使用以下代码获取标签:

    for token in doc:
        print(token.text, token.tag_)
  1. Morphological processing: lemmatization - it's a process of transforming the words into a linguistically valid base form, called the lemma:

    nouns → singular nominative form
    verbs → infinitive form
    adjectives → singular, nominative, masculine, indefinitive, positive form
    
在SpaCy中,将文本放入nlp实例中即可为您完成此操作。您可以通过以下方式获取每个单词的词形变化:
    for token in doc:
        print(token.text, token.lemma_)
  1. Removing stopwords - stopwords are the words that are not bringing any new information or meaning to the sentence and can be omitted. You guessed, it's also already done for you by nlp instance. To filter the stopwords just type:

    text_without_stopwords = [token.text for token in doc if not token.is_stop]
    doc = nlp(' '.join(text_without_stopwords))
    
现在你有一个干净的数据集。你可以使用word2vecGloVe预训练模型创建词向量,并将你的数据输入到某个模型中。或者,你可以使用TF-IDF来创建单词向量,通过删除最常见的单词。此外,与通常的过程相反,你可能希望保留最具体的单词,因为你的任务是更好地区分两个文本。希望这足够清楚 :)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接