在自然语言处理中合并相关词汇

23

我希望定义一个新词,它包含来自两个(或更多)不同单词的计数值。例如:

Words Frequency
0   mom 250
1   2020    151
2   the 124
3   19  82
4   mother  81
... ... ...
10  London  6
11  life    6
12  something   6

我希望将母亲定义为妈妈 + 母亲

Words Frequency
0   mother  331
1   2020    151
2   the 124
3   19  82
... ... ...
9   London  6
10  life    6
11  something   6

这是一种替代定义具有某些意义的单词组的方法(至少对于我的目的而言)。

欢迎提出任何建议。


1
谢谢Chandan。然而,它似乎不太准确。例如,如果我搜索teaching,teacher将不会被包括在同义词中(或仅作为类似的单词)。 - user13623188
3
老师是一个名词,教学是一个动词。讲师和老师是同义词,教学和演讲可以被认为是同义词。无论如何,点击这里-http://bionlp-www.utu.fi/wv_demo/,使用word2vec相似性来查找类似的单词。另一个选择是WordNet。 - Adnan S
2
我将此标题改为“在NLP中合并相关单词”,因为这似乎是您的意图。实际的合并操作(在字典、计数器或CountVectorizer上)是微不足道的部分;正如您所说,困难的部分是通过查找某些知识库/词汇表/使用word2vec相似性等来推断哪些单词是相关的。 - smci
3
根据 @smci 的说法,你需要先定义一下“相关词”的概念。Word2vec(或其他词嵌入技术)可以给你提供“相似”的单词,但是它们提出的单词对可能与你的需求相差甚远。 - Jason Angel
1
@Val 我在我的回答中发布了一些新内容。 - Life is complex
显示剩余6条评论
6个回答

13

更新于2020年10月21日

我决定开发一个Python模块来处理本回答中提到的任务。该模块名为wordhoard,可以从pypi下载。


我曾尝试在需要确定关键词(例如“healthcare”)及其同义词(例如“wellness program”、“preventive medicine”)频率的项目中使用Word2vec和WordNet 。我发现大多数NLP库无法产生我所需的结果,因此我决定使用自定义关键词和同义词构建自己的字典。这种方法在多个项目中都成功地应用于文本分析和分类。

我相信精通NLP技术的人可能会有更强大的解决方案,但下面这个解决方案对我来说一次又一次地奏效。

我编写了我的答案以匹配您问题中的单词频率数据,但它可以修改为使用任何关键词和同义词数据集。

import string

# Python Dictionary
# I manually created these word relationship - primary_word:synonyms
word_relationship = {"father": ['dad', 'daddy', 'old man', 'pa', 'pappy', 'papa', 'pop'],
          "mother": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}

# This input text is from various poems about mothers and fathers
input_text = 'The hand that rocks the cradle also makes the house a home. It is the prayers of the mother ' \
         'that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of ' \
         'her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She ' \
         'has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the ' \
         'greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend, ' \
         'This to me you have always been. Through the good times and the bad, Your understanding I have had.'

# converts the input text to lowercase and splits the words based on empty space.
wordlist = input_text.lower().split()

# remove all punctuation from the wordlist
remove_punctuation = [''.join(ch for ch in s if ch not in string.punctuation) 
for s in wordlist]

# list for word frequencies
wordfreq = []

# count the frequencies of a word
for w in remove_punctuation:
wordfreq.append(remove_punctuation.count(w))

word_frequencies = (dict(zip(remove_punctuation, wordfreq)))

word_matches = []

# loop through the dictionaries
for word, frequency in word_frequencies.items():
   for keyword, synonym in word_relationship.items():
      match = [x for x in synonym if word == x]
      if word == keyword or match:
        match = ' '.join(map(str, match))
        # append the keywords (mother), synonyms(mom) and frequencies to a list
        word_matches.append([keyword, match, frequency])

# used to hold the final keyword and frequencies
final_results = {}

# list comprehension to obtain the primary keyword and its frequencies
synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]

# iterate synonym_matches and output total frequency count for a specific keyword
for item in synonym_matches:
  if item[0] not in final_results.keys():
    frequency_count = 0
    frequency_count = frequency_count + item[1]
    final_results[item[0]] = frequency_count
  else:
    frequency_count = frequency_count + item[1]
    final_results[item[0]] = frequency_count

 
print(final_results)
# output
{'mother': 3, 'father': 2}

其他方法

以下是一些其他方法及其开箱即用的输出。


NLTK WORDNET

在这个例子中,我查找了单词"mother"的同义词。请注意,WordNet没有将"mom"或"mum"这两个同义词与单词"mother"联系起来。这两个词都在我的示例文本中。还要注意,“father”被列为“mother”的一个同义词。

from nltk.corpus import wordnet

synonyms = []
word = 'mother'
for synonym in wordnet.synsets(word):
   for item in synonym.lemmas():
      if word != synonym.name() and len(synonym.lemma_names()) > 1:
        synonyms.append(item.name())

print(synonyms)
['mother', 'female_parent', 'mother', 'fuss', 'overprotect', 'beget', 'get', 'engender', 'father', 'mother', 'sire', 'generate', 'bring_forth']

PyDictionary

使用PyDictionary查询synonym.com,我以单词"mother"为例查询其同义词。这个例子中同义词包括单词"mom"和"mum"。同时,该例子还列举了WordNet未生成的其他同义词。

然而,PyDictionary也为"mum"生成了一个同义词列表。这与单词"mother"无关。看起来PyDictionary从页面的形容词部分提取了这个列表,而不是名词部分。对于计算机来说,很难区分形容词和名词中的"mum"。

from PyDictionary import PyDictionary
dictionary_mother = PyDictionary('mother')

print(dictionary_mother.getSynonyms())
# output 
[{'mother': ['mother-in-law', 'female parent', 'supermom', 'mum', 'parent', 'mom', 'momma', 'para I', 'mama', 'mummy', 'quadripara', 'mommy', 'quintipara', 'ma', 'puerpera', 'surrogate mother', 'mater', 'primipara', 'mammy', 'mamma']}]

dictionary_mum = PyDictionary('mum')

print(dictionary_mum.getSynonyms())
# output 
[{'mum': ['incommunicative', 'silent', 'uncommunicative']}]

其他可能的方法包括使用牛津词典API或查询thesaurus.com。这两种方法也存在一些问题。例如,牛津词典API需要一个API密钥和基于查询数量的付费订阅。而thesaurus.com缺少潜在的同义词,这在将单词分组时可能会有用。

https://www.thesaurus.com/browse/mother
synonyms: mom, parent, ancestor, creator, mommy, origin, predecessor, progenitor, source, child-bearer, forebearer, procreator

更新

为您的语料库中的每个潜在单词制作精确的同义词列表很困难,并且需要多管齐下。以下代码使用WordNet和PyDictionary创建了一个超集合的同义词。与所有其他答案一样,这种组合方法也会导致某些单词频率的过度计数。我一直在尝试通过合并最终同义词字典中的键和值对来减少这种过度计数。后者比我预期的要困难得多,可能需要我自己发布问题来解决。最终,我认为基于您的用例,您需要确定哪种方法最有效,并且可能需要结合几种方法。

感谢您发布这个问题,因为它让我看到了解决复杂问题的其他方法。

from string import punctuation
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from PyDictionary import PyDictionary

input_text = """The hand that rocks the cradle also makes the house a home. It is the prayers of the mother
         that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of
         her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She
         has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the
         greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend,
         This to me you have always been. Through the good times and the bad, Your understanding I have had."""


def normalize_textual_information(text):
   # split text into tokens by white space
   token = text.split()

   # remove punctuation from each token
   table = str.maketrans('', '', punctuation)
   token = [word.translate(table) for word in token]

   # remove any tokens that are not alphabetic
   token = [word.lower() for word in token if word.isalpha()]

   # filter out English stop words
   stop_words = set(stopwords.words('english'))

   # you could add additional stops like this
   stop_words.add('cannot')
   stop_words.add('could')
   stop_words.add('would')

   token = [word for word in token if word not in stop_words]

   # filter out any short tokens
   token = [word for word in token if len(word) > 1]
   return token


def generate_word_frequencies(words):
   # list to hold word frequencies
   word_frequencies = []

   # loop through the tokens and generate a word count for each token
   for word in words:
      word_frequencies.append(words.count(word))

   # aggregates the words and word_frequencies into tuples and coverts them into a dictionary
   word_frequencies = (dict(zip(words, word_frequencies)))

   # sort the frequency of the words from low to high
   sorted_frequencies = {key: value for key, value in 
   sorted(word_frequencies.items(), key=lambda item: item[1])}

 return sorted_frequencies


def get_synonyms_internet(word):
   dictionary = PyDictionary(word)
   synonym = dictionary.getSynonyms()
   return synonym

 
words = normalize_textual_information(input_text)

all_synsets_1 = {}
for word in words:
  for synonym in wordnet.synsets(word):
    if word != synonym.name() and len(synonym.lemma_names()) > 1:
      for item in synonym.lemmas():
        if word != item.name():
          all_synsets_1.setdefault(word, []).append(str(item.name()).lower())

all_synsets_2 = {}
for word in words:
  word_synonyms = get_synonyms_internet(word)
  for synonym in word_synonyms:
    if word != synonym and synonym is not None:
      all_synsets_2.update(synonym)

 word_relationship = {**all_synsets_1, **all_synsets_2}

 frequencies = generate_word_frequencies(words)
 word_matches = []
 word_set = {}
 duplication_check = set()

 for word, frequency in frequencies.items():
    for keyword, synonym in word_relationship.items():
       match = [x for x in synonym if word == x]
       if word == keyword or match:
         match = ' '.join(map(str, match))
         if match not in word_set or match not in duplication_check or word not in duplication_check:
            duplication_check.add(word)
            duplication_check.add(match)
            word_matches.append([keyword, match, frequency])

 # used to hold the final keyword and frequencies
 final_results = {}

 # list comprehension to obtain the primary keyword and its frequencies
 synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]

 # iterate synonym_matches and output total frequency count for a specific keyword
 for item in synonym_matches:
    if item[0] not in final_results.keys():
      frequency_count = 0
      frequency_count = frequency_count + item[1]
      final_results[item[0]] = frequency_count
 else:
    frequency_count = frequency_count + item[1]
    final_results[item[0]] = frequency_count

# do something with the final results

4

这是一个困难的问题,最佳解决方案取决于你试图解决的用例。这是一个困难的问题,因为要组合单词就需要理解单词的语义。你可以将mommother组合在一起,因为它们在语义上相关。

识别两个单词是否具有语义相关性的一种方法是利用分布式词嵌入(向量),例如word2vec、Glove、fasttext等。您可以找到所有单词相对于某个单词的向量的余弦相似度,并选择前5个相近的单词并创建新单词。

使用word2vec的示例:

# Load a pretrained word2vec model
import gensim.downloader as api
model = api.load('word2vec-google-news-300')

vectors = [model.get_vector(w) for w in words]
for i, w in enumerate(vectors):
   first_best_match = model.cosine_similarities(vectors[i], vectors).argsort()[::-1][1]
   second_best_match = model.cosine_similarities(vectors[i], vectors).argsort()[::-1][2]
   
   print (f"{words[i]} + {words[first_best_match]}")
   print (f"{words[i]} + {words[second_best_match]}")  

输出:

mom + mother
mom + teacher
mother + mom
mother + teacher
london + mom
london + life
life + mother
life + mom
teach + teacher
teach + mom
teacher + teach
teacher + mother

您可以尝试将阈值放在余弦相似度上,只选择余弦相似度大于此阈值的内容。

语义相似性的一个问题在于它们可能是语义上相反的,因此它们是相似的(男人 - 女人),另一方面(男人 - 国王)是语义上相似的,因为它们相同。


嗨,mujjiga,我可以问一下,我应该包括哪些文本或单词进行分析吗? - still_learning
找出非停用词中经常使用的单词,并计算它们之间的余弦相似度,查看前几个是否有意义。 - mujjiga

2

另外一种解决这个问题的古怪方法是使用经典的PyDictionary库。您可以使用

dictionary.getSynonyms()

编写一个函数来循环遍历列表中的所有单词并将它们分组。所有可用的同义词都将被覆盖并映射到一个组中。因此,您可以分配最终变量并总结同义词。在您的示例中,您选择Mother作为最终单词,显示同义词的最终计数。


2
你想要实现的是“语义文本相似度”。我推荐使用TensorFlow通用句子编码器,例如:
#@title Load the Universal Sentence Encoder's TF Hub module
from absl import logging

import tensorflow as tf

import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

def plot_similarity(labels, features, rotation):
  corr = np.inner(features, features)
  sns.set(font_scale=1.2)
  g = sns.heatmap(
      corr,
      xticklabels=labels,
      yticklabels=labels,
      vmin=0,
      vmax=1,
      cmap="YlOrRd")
  g.set_xticklabels(labels, rotation=rotation)
  g.set_title("Semantic Textual Similarity")

def run_and_plot(messages_):
  message_embeddings_ = embed(messages_)
  plot_similarity(messages_, message_embeddings_, 90)

messages = [
    "Mother",
    "Mom",
    "Mama",
    "Dog",
    "Cat"
]

run_and_plot(messages)

enter image description here

这个例子是用Python编写的,但我也创建了一个在基于JVM的语言中加载模型的示例。

https://github.com/ntedgi/universal-sentence-encoder


嗨Naor,我遇到了这个错误:ValueError:必须传递2-d输入。 - still_learning
请运行此Colab:https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb。 - Naor Tedgi

1
你可以生成单词嵌入向量并使用一些聚类算法。最后,你需要调整算法的超参数以达到高精度的结果。
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA

import spacy

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Load the large english model
nlp = spacy.load("en_core_web_lg")

tokens = nlp("dog cat banana apple teaching teacher mom mother mama mommy berlin paris")

# Generate word embedding vectors
vectors = np.array([token.vector for token in tokens])
vectors.shape
# (12, 300)

让我们使用主成分分析算法将嵌入可视化到三维空间中:
pca_vecs = PCA(n_components=3).fit_transform(vectors)
pca_vecs.shape
# (12, 3)

fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111, projection='3d')
xs, ys, zs = pca_vecs[:, 0], pca_vecs[:, 1], pca_vecs[:, 2]
_ = ax.scatter(xs, ys, zs)

for x, y, z, lable in zip(xs, ys, zs, tokens):
    ax.text(x+0.3, y, z, str(lable))

enter image description here

让我们使用DBSCAN算法对单词进行聚类:
model = DBSCAN(eps=5, min_samples=1)
model.fit(vectors)

for word, cluster in zip(tokens, model.labels_):
    print(word, '->', cluster)

输出:

dog -> 0
cat -> 0
banana -> 1
apple -> 2
teaching -> 3
teacher -> 3
mom -> 4
mother -> 4
mama -> 4
mommy -> 4
berlin -> 5
paris -> 6

我觉得你对我的问题的处理方式非常有趣。我尝试将其应用于句子,但效果不是很好,所以我发了一个问题并开始了悬赏。如果你想看一下:https://dev59.com/Ur3pa4cB1Zd3GeqPYCHc - user13623188

-1

matthewreagan/WebstersEnglishDictionary

这个想法是使用该字典来识别相似的单词。

简而言之:运行一些知识发现算法,根据英语语法提取知识。

这里有一个同义词词典:它有18MB。

这里是同义词词典的摘录,您可以尝试通过某些算法匹配单词的备选项。

{"word": "ma", "key": "ma_1", "pos": "noun", "synonyms": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}

如果需要使用外部 API 进行快速修复,则可以参考以下链接:它们允许通过 API 实现更多功能,如获取同义词、查找多个定义、查找押韵单词等。

WORDAPI


嗨nikhli,感谢您的回答。我想更了解您在最后一步中建议的内容。“这里是词库的摘录”。您是否正在定义一个类,即如果我找到其中一个单词,我可以将包含它们的所有术语视为相同的名词? - user13623188
1
不要指望这个“词库”会有太大帮助——“mom”的条目没有列出“mother”:{"word": "mom", "key": "mom_1", "pos": "noun", "synonyms": ["mamma", "momma", "mama", "mammy", "ma", "mumm", "mommy", "mum"]},而“mother”也没有提到“mom”。当然,“mother”还可以作为动词。这是一个非常棘手的问题。 - DisappointedByUnaccountableMod
单词匹配并过滤出单独的词组 == 避免像“眨眼之间”=“快速”这样的词组。虽然确实非常不容易,但一对一的单词分组会很好。我不太了解 ML,但有一个叫余弦相似度的东西,它使用 K-means 识别两个单词的接近程度。您也可以使用 wordapi 进行快速修复,我在答案中进行了更新。 - nikhil swami

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接