如何在Python中对维基百科分类进行分组？

Question

如何在Python中对维基百科分类进行分组？

pythonmediawikiwikipediawikipedia-apimediawiki-api

21

对于数据集中的每个概念，我都存储了相应的维基百科类别。例如，考虑以下5个概念及其相应的维基百科类别。

- 高三酰甘油血症: ['类别：脂质代谢紊乱', '类别：与肥胖相关的医学状况'] - 酶抑制剂: ['类别：酶抑制剂', '类别：药物化学', '类别：新陈代谢'] - 旁路手术: ['类别：手术存根', '类别：外科手术和技术'] - 珀斯: ['类别：澳大利亚1829年建立', '类别：澳大利亚首都城市', '类别：澳大利亚大都市区域', '类别：西澳大利亚州珀斯', '类别：1829年建立的有人口定居点'] - 气候: ['类别：气候', '类别：气象学', '类别：气象概念']

正如您所看到的，前三个概念属于医学领域（而后两个术语不是医学术语）。

更准确地说，我想将我的概念分为医学和非医学。然而，仅使用类别来划分概念非常困难。例如，即使两个概念“酶抑制剂”和“旁路手术”都属于医学领域，它们的类别也非常不同。

因此，我想知道是否有一种方法可以获取类别的父类别（例如，“酶抑制剂”和“旁路手术”的类别属于“医学”父类别）。

我目前正在使用pymediawiki和pywikibot。但我不仅限于这两个库，并乐意使用其他库的解决方案。

编辑

如@IlmariKaronen所建议，我也在使用categories of categories，结果如下（类别附近的小字是categories of the category）。

然而，我仍然找不到一种方法来使用这些类别详细信息来确定一个给定术语是医学还是非医学。

此外，正如@IlmariKaronen指出的那样，使用Wikiproject详细信息可能是有潜力的。然而，似乎Medicine wikiproject并没有所有的医学术语。因此，我们还需要检查其他wikiprojects。

编辑：

从维基百科概念中提取类别的我的当前代码如下。这可以使用pywikibot或pymediawiki完成。

Using the librarary pymediawiki

import mediawiki as pw

p = wikipedia.page('enzyme inhibitor')
print(p.categories)

Using the library pywikibot

import pywikibot as pw

site = pw.Site('en', 'wikipedia')

print([
    cat.title()
    for cat in pw.Page(site, 'support-vector machine').categories()
    if 'hidden' not in cat.categoryinfo
])

类别的分类也可以像@IlmariKaronen的答案所示的那样进行。

如果您正在寻找更多的测试概念列表，我在下面提供了更多的示例。

['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'newyork', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']

对于非常长的列表，请查看下面的链接。https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing

注意：我不期望解决方案能够100％地工作（如果所提算法能够检测到许多医学概念，那就足够了）

如果需要，我很乐意提供更多细节。

- EmJ

1

那些明确是药品的，例如ICD链接，但这不包括酶。 - tripleee

1

这些 dbc: 是您提取的维基百科类别。我建议检查这些类别是否具有 dbc:Medicine 作为祖先类别。如果超过一半的概念类别具有 dbc:Medicine 作为祖先类别，则可以将此概念视为“医学”。 - Stanislav Kralin

1

这些 {m,n} 是 Virtuoso 特定的 SPARQL 1.1 属性路径扩展。您也可以尝试使用“未限定”的 skos:broader+ 或 skos:broader+。 - Stanislav Kralin

1

" ASK { dbc:Lipid_metabolism_disorders skos:broader+ dbc:Medicine } " - Stanislav Kralin

1

这就是SPARQL :-). +表示一个或多个，*表示零个或多个。{1,7}表示从一到七跳，但只有Virtuoso SPARQL终端支持。 - Stanislav Kralin

显示剩余14条评论

6个回答

8

因此，我想知道是否有一种方法可以获取类别的父类别（例如，酶抑制剂和旁路手术的类别属于医学父类别）。

MediaWiki类别本身就是维基页面。 “父类别”只是“子”类别页面所属的类别。因此，您可以以与获取任何其他维基页面的类别相同的方式获取类别的父类别。

例如，使用pymediawiki：

p = wikipedia.page('Category:Enzyme inhibitors')
parents = p.categories

- Ilmari Karonen

2

但这并不能立即解决提问者的问题。每个类别可以属于一个或多个类别，其中一些可能明确属于“医学”；但我还没有解决到让任何给定的示例易于确定是或否的程度。 - tripleee

3

@tripleee：确实。在问题的结尾，提问者表示他们想通过查看父类别来完成这个任务，并询问如何找到它们，因此我假设这是他们具体的问题。无论这是否真正有助于解决其原始问题，我都无法确定。（另一个可能的方法是寻找相关的维基项目。或者甚至尝试应用某种统计聚类算法。） - Ilmari Karonen

1

@Emi：看一下与维基项目相关的（元）类别。例如，这是所有维基医学项目文章的类别。请注意，它包含文章的“讨论页面”，因为那里有相关的模板，但很容易从讨论页名称中获取文章名称（只需删除“Talk:”前缀）。 - Ilmari Karonen

1

很不幸，它们似乎没有像医学维基项目那样的单一大类别，包含所有文章，因此您需要将所有“按质量分类”的类别合并在一起。或者，作为替代（无意创造双关语），您可以直接查找转录其模板的讨论页面。 - Ilmari Karonen

1

您可以提供整个数据集吗？我有几个想法，但希望能用您的数据（或其中相当一部分，包括负面非医学案例和阳性医学案例）进行测试。 - Szymon Maszke

显示剩余13条评论

6

你可以尝试通过每个维基百科类别返回的MediaWiki链接和反向链接来对其进行分类。

import re
from mediawiki import MediaWiki

#TermFind will search through a list a given term
def TermFind(term,termList):
    responce=False
    for val in termList:
        if re.match('(.*)'+term+'(.*)',val):
            responce=True
            break
    return responce

#Find if the links and backlinks lists contains a given term 
def BoundedTerm(wikiPage,term):
    aList=wikiPage.links
    bList=wikiPage.backlinks
    responce=False
    if TermFind(term,aList)==True and TermFind(term,bList)==True:
         responce=True
    return responce

container=[]
wikipedia = MediaWiki()
for val in termlist:
    cpage=wikipedia.page(val)
    if BoundedTerm(cpage,'term')==True:
        container.append('medical')
    else:
        container.append('nonmedical')

这个想法是尝试猜测大多数类别共享的术语，我尝试了生物学、医学和疾病，效果不错。也许你可以尝试使用多次BoundedTerms调用来进行分类，或者一次调用多个术语并将结果组合以进行分类。希望能有所帮助。

- TavoGLC

你好，非常感谢您的回答。然而，在运行时我遇到了以下错误：NameError: name 'wikipedia' is not defined。请问您能告诉我如何解决这个问题吗？ :) - EmJ

1

抱歉，我编辑了答案，我忘记添加了wikipedia = MediaWiki()。 - TavoGLC

非常感谢。问题已经解决了。对于术语“血管扩张”，我收到了“mediawiki.exceptions.DisambiguationError:”的错误信息。然而，“血管扩张”是一个有效的维基百科页面。您知道这是为什么吗？ :) - EmJ

1

抱歉，我无法复制该错误，但是当术语可能有多种含义时，会发生消歧义错误，例如（https://en.wikipedia.org/wiki/Raby）。此外，扩血管作用在维基百科的消歧义索引中不存在（https://en.wikipedia.org/wiki/Wikipedia:Links_to_disambiguating_pages_(V)），也许是另一个导致错误的术语。希望这可以帮助。 - TavoGLC

@anand_v.singh 抱歉，我把问题移动到了开放数据 https://opendata.stackexchange.com/questions/15206/how-to-identify-general-medical-terms-using-wikipedia-dbpedia-wikidata 如果您知道答案，请告诉我。非常感谢 :) - EmJ

显示剩余2条评论

5

wikipedia库也是从给定页面提取类别的好选择，例如wikipedia.WikipediaPage(page).categories返回一个简单的列表。该库还允许您搜索具有相同标题的多个页面。

在医学中似乎有很多关键词根和后缀，因此找到关键词可能是找到医学术语的好方法。

import wikipedia

def categorySorter(targetCats, pagesToCheck, mainCategory):
    targetList = []
    nonTargetList = []
    targetCats = [i.lower() for i in targetCats]

    print('Sorting pages...')
    print('Sorted:', end=' ', flush=True)
    for page in pagesToCheck:

        e = openPage(page)

        def deepList(l):
            for item in l:
                if item[1] == 'SUBPAGE_ID':
                    deepList(item[2])
                else:
                    catComparator(item[0], item[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])

        if e[1] == 'SUBPAGE_ID':
            deepList(e[2])
        else:
            catComparator(e[0], e[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])

    print()
    print()
    print('Results:')
    print(mainCategory, ': ', targetList, sep='')
    print()
    print('Non-', mainCategory, ': ', nonTargetList, sep='')

def openPage(page):
    try:
        pageList = [page, wikipedia.WikipediaPage(page).categories]
    except wikipedia.exceptions.PageError as p:
        pageList = [page, 'NONEXIST_ID']
        return
    except wikipedia.exceptions.DisambiguationError as e:
        pageCategories = []
        for i in e.options:
            if '(disambiguation)' not in i:
                pageCategories.append(openPage(i))
        pageList = [page, 'SUBPAGE_ID', pageCategories]
        return pageList
    finally:
        return pageList

def catComparator(pageTitle, pageCategories, targetCats, targetList, nonTargetList, lastPage):

    # unhash to view the categories of each page
    #print(pageCategories)
    pageCategories = [i.lower() for i in pageCategories]

    any_in = False
    for i in targetCats:
        if i in pageTitle:
            any_in = True
    if any_in:
        print('', end = '', flush=True)
    elif compareLists(targetCats, pageCategories):
        any_in = True

    if any_in:
        targetList.append(pageTitle)
    else:
        nonTargetList.append(pageTitle)

    # Just prints a pretty list, you can comment out until next hash if desired
    if any_in:
        print(pageTitle, '(T)', end='', flush=True)
    else:
        print(pageTitle, '(F)',end='', flush=True)

    if pageTitle != lastPage:
        print(',', end=' ')
    # No more commenting

    return any_in

def compareLists (a, b):
    for i in a:
        for j in b:
            if i in j:
                return True
    return False

这段代码实际上是将关键词列表和后缀与每个页面的标题及其类别进行比较，以确定页面是否与医学相关。它还查看更大主题的相关页面/子页面，并确定是否相关。我对医学不是很熟悉，所以请原谅我的分类方式，这里提供一个示例：

medicalCategories = ['surgery', 'medic', 'disease', 'drugs', 'virus', 'bact', 'fung', 'pharma', 'cardio', 'pulmo', 'sensory', 'nerv', 'derma', 'protein', 'amino', 'unii', 'chlor', 'carcino', 'oxi', 'oxy', 'sis', 'disorder', 'enzyme', 'eine', 'sulf']
listOfPages = ['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']
categorySorter(medicalCategories, listOfPages, 'Medical')

这个例子列表包含了大约70%的应该在列表上的内容，至少就我所知。

- Londala

5

在自然语言处理中有一个词向量的概念，它基本上通过查看大量文本，尝试将单词转换为多维向量，然后缩小这些向量之间的距离，距离越小表示相似度越高。好消息是，许多人已经生成了这些词向量，并在非常宽容的许可下使它们可用。在您的情况下，您正在使用维基百科，并且这里存在与其相关的词向量：http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2。

现在，这些词向量最适合此任务，因为它们包含来自维基百科的大部分单词，但如果它们不适合您，或者在将来被删除，您可以使用我下面列出的其他词向量。话虽如此，还有一种更好的方法，即通过将它们传递给tensorflow的通用语言模型embed模块来完成，您不必做大部分重活，您可以在这里阅读更多信息。我把它放在维基百科文本转储后面是因为我听说在处理医学样本时，它们有点难以处理。虽然这篇论文提出了解决方案，但我从未尝试过，所以无法确定其准确性。

现在，您如何使用来自tensorflow的词嵌入非常简单，只需执行以下操作：

embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
embeddings = embed(["Input Text here as"," List of strings"])
session.run(embeddings)

由于您可能对TensorFlow不熟悉，如果尝试运行此代码片段，则可能会遇到一些问题，请单击此处访问完整的使用说明文档，然后您应该能够轻松地将其修改为满足您的需求。

话虽如此，我建议您首先查看Tensorflow的嵌入模块和他们预先训练的词嵌入，如果这些不能满足您的需求，请查看Wikimedia链接，如果这也不起作用，请查看我提供的论文中的概念。由于本答案描述了一种自然语言处理（NLP）方法，因此它不会是100%准确的，请在继续之前记住这一点。

Glove Vectors https://nlp.stanford.edu/projects/glove/

Facebook's fast text: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

Or this http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz

如果在按照colab教程后实施时遇到问题，请将您的问题添加到本问题中并进行评论，我们可以从那里进一步进行。

编辑：添加了对主题进行聚类的代码

简言之，我对其摘要句子进行编码而不是使用单词向量

文件内容.py

def AllTopics():
    topics = []# list all your topics, not added here for space restricitons
    for i in range(len(topics)-1):
        yield topics[i]

文件 summaryGenerator.py

import wikipedia
import pickle
from content import Alltopics
summary = []
failed = []
for topic in Alltopics():
    try:
        summary.append(wikipedia.summary(tuple((topic,str(topic)))))
    except Exception as e:
        failed.append(tuple((topic,e)))
with open("summary.txt", "wb") as fp:
    pickle.dump(summary , fp)
with open('failed.txt', 'wb') as fp:
    pickle.dump('failed', fp)

文件 SimilartiyCalculator.py

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import os
import pandas as pd
import re
import pickle
import sys
from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics
from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix


try:
    with open("summary.txt", "rb") as fp:   # Unpickling
        summary = pickle.load(fp)
except Exception as e:
    print ('Cannot load the summary file, Please make sure that it exists, if not run Summary Generator first', e)
    sys.exit('Read the error message')

module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
embed = hub.Module(module_url)

tf.logging.set_verbosity(tf.logging.ERROR)
messages = [x[1] for x in summary]
labels = [x[0] for x in summary]
with tf.Session() as session:
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    message_embeddings = session.run(embed(messages)) # In message embeddings each vector is a second (1,512 vector) and is numpy.ndarray (noOfElemnts, 512)

X = message_embeddings
agl = AgglomerativeClustering(n_clusters=5, affinity='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', pooling_func='deprecated')
agl.fit(X)
dist_matrix = distance_matrix(X,X)
Z = hierarchy.linkage(dist_matrix, 'complete')
dendro = hierarchy.dendrogram(Z)
cluster_labels = agl.labels_

这个项目也托管在GitHub上，网址是https://github.com/anandvsingh/WikipediaSimilarity，您可以在那里找到similarity.txt文件和其他文件。在我的情况下，我无法在所有主题上运行它，但我建议您在完整的主题列表上运行它（直接克隆存储库并运行SummaryGenerator.py），如果您没有得到预期的结果，请通过拉取请求上传similarity.txt。如果可能，还要将message_embeddings上传为csv文件作为主题及其嵌入。 编辑2后的更改 将相似性生成器切换为基于层次结构的聚类（凝聚）。我建议您将标题名称保留在树状图的底部，具体请查看此处的树状图定义，我查看了一些样本并且结果看起来非常好，您可以更改n_clusters值以微调模型。注意：这需要您再次运行摘要生成器。我认为您应该能够从这里开始尝试一些n_cluster值，并查看哪些医学术语被分组在一起，然后找到该组的cluster_label即可。由于我们是根据摘要进行分组，因此聚类将更加准确。如果遇到任何问题或不理解某些内容，请在下方评论。

- anand_v.singh

1

@Emi，我遇到了一些互联网速度瓶颈，请查看编辑内容，并将结果上传到Google Drive或直接上传到GitHub存储库（如果可以的话）:)。 - anand_v.singh

1

@Emi，看起来你运行了程序的早期版本，因为在summary.txt中它应该是一个包含主题和摘要的元组列表，但在这个版本中只有摘要。我将在这个数据集上运行它，如果可以正常工作，我会告诉你再次运行以解决这个问题。 - anand_v.singh

1

@Emi 不是很需要，这对我来说已经足够验证它是否有效，然后我会建议你自己去做，并关注摘要的主题。 - anand_v.singh

1

@Emi，现在更新已经在这里和Github上都可以获取了。请阅读答案底部的编辑部分中的更改内容，以了解发生了什么变化以及如何继续进行。 - anand_v.singh

1

我还会将@szymonmaszke的主动学习部分附加到其中，以减少您需要进行的开销，但我仍然会坚持使用上面所用的摘要文本嵌入，而不仅仅是单词嵌入。 - anand_v.singh

显示剩余20条评论

4

这个问题对我来说有些不清楚，似乎没有一个直接的解决方法，可能需要一些自然语言处理模型。另外，“概念”和“类别”这两个词是可以互换使用的。我的理解是，如酶抑制剂、旁路手术和高三酸甘油脂血症等概念需要作为医学相关的概念被合并在一起，其余则作为非医学相关的概念。这个问题需要比仅有的类别名称更多的数据。需要一个语料库来训练LDA模型（例如），其中整个文本信息都被输入到算法中，它会返回每个概念最可能的主题。 https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/

- Meena Nagarajan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Szymon Maszke · Accepted Answer

解决方案概述

好的，我会从多个方向来解决问题。这里有一些很好的建议，如果我是你，我会使用这些方法的组合（例如，在二元情况下，通过多数表决预测标签，该标签由50%以上分类器同意）。

我考虑以下方法：

主动学习（我提供的示例方法）
MediaWiki反向链接 由@TavoGC提供的答案
SPARQL祖先类别由@Stanislav Kralin在您的问题评论中提供和/或parent categories由@Meena Nagarajan提供（这两个可以根据它们的差异成为自己的集合，但是为此您必须联系两个创建者并比较他们的结果）。

这样，三个中的两个将必须同意某个概念是医学概念，从而进一步减小错误的可能性。

在此期间，我反对@ananand_v.singh在this answer中提出的方法，因为：

距离度量不应该是欧几里得度量，余弦相似度是更好的度量方式（例如，spaCy使用的方式），因为它不考虑向量的大小（而且不应该，这就是word2vec或GloVe的训练方式）。
如果我理解正确，将会创建许多人工聚类，而我们只需要两个：医学和非医学。此外，药物的质心不是集中在药物本身上。这带来了额外的问题，例如质心远离药物，并且其他词语（如computer或human或任何你认为不适合于医学的词语）可能进入聚类。
很难评估结果，更何况这个问题是严格主观的。此外，单词向量很难可视化和理解（使用PCA / TSNE /类似方法将它们投射到较低的维度[2D / 3D]，对于您的更长数据集，PCA获得约5％的解释方差，真的非常低）。

基于上述问题，我提出了使用主动学习的解决方案，这是一种相当被遗忘的方法。

主动学习方法中，当我们难以想出一个精确的算法（比如什么是“医疗”类别的术语），我们会向专家（实际上不必是专家）寻求答案。

正如anand_v.singh所指出的那样，词向量是最有前途的方法之一，我在这里也将使用它（虽然方式不同，但我认为更加清晰易懂）。

不要使用当前可用的最先进的上下文化词嵌入（例如BERT）。检查你的概念中有多少个没有表示（例如被表示为零向量）。应该进行检查（并且已经在我的代码中进行了检查，在讨论时间到来时会有进一步的讨论），可以使用其中大多数存在的嵌入。

使用spaCy来测量相似度。

这个类测量药品编码为spaCy的GloVe词向量与其他概念之间的相似度。

class Similarity:
    def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
        # In our case it will be medicine
        self.centroid = centroid

        # spaCy's Language model (english), which will be used to return similarity to
        # centroid of each concept
        self.nlp = nlp
        self.n_threads: int = n_threads
        self.batch_size: int = batch_size

        self.missing: typing.List[int] = []

    def __call__(self, concepts):
        concepts_similarity = []
        # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
        for i, concept in enumerate(
            self.nlp.pipe(
                concepts, n_threads=self.n_threads, batch_size=self.batch_size
            )
        ):
            if concept.has_vector:
                concepts_similarity.append(self.centroid.similarity(concept))
            else:
                # If document has no vector, it's assumed to be totally dissimilar to centroid
                concepts_similarity.append(-1)
                self.missing.append(i)

        return np.array(concepts_similarity)

这段代码将返回每个概念与质心相似程度的数字，并记录缺失表征的概念的索引。可以这样调用它：

import json
import typing

import numpy as np
import spacy

nlp = spacy.load("en_vectors_web_lg")

centroid = nlp("medicine")

concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
    concepts
)

您可以将数据替换为new_concepts.json。

查看spacy.load并注意我使用了en_vectors_web_lg。它包含 685,000个唯一的词向量（非常多），并且可能可直接用于您的情况。安装spaCy后，您必须单独下载它，更多信息在上面的链接中提供。

此外，您可能希望使用多个质心词，例如添加像disease或health这样的词，并平均它们的词向量。但我不确定这是否会对您的情况产生积极影响。

其他可能性是使用多个质心，并计算每个概念与多个质心之间的相似度。在这种情况下，我们可能有几个阈值，这很可能会消除一些false positives，但可能会错过一些被认为类似于medicine的术语。此外，它还会使情况变得更加复杂，但如果您的结果不令人满意，则应考虑上述两个选项（仅在没有考虑之前不要轻易采用此方法）。

现在，我们有了概念相似度的粗略测量。但是，某个概念与医学的相似度为0.1是什么意思呢？它是应该归类为医学概念吗？还是说已经太远了？

询问专家

为了得到阈值（低于此阈值将被视为非医学术语），最简单的方法是请人类为我们分类一些概念（这就是主动学习的内容）。是的，我知道这是一种非常简单的主动学习形式，但我仍然认为它属于主动学习范畴。

我编写了一个类，具有类似于sklearn的接口，可以要求人类对概念进行分类，直到达到最优阈值（或最大迭代次数）。

class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        max_steps: int,
        samples: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.max_steps: int = max_steps
        self.samples: int = samples
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

samples参数描述了每次迭代中专家将看到多少个示例（这是最大值，如果已经要求样本或者没有足够的样本显示，则会返回较少的数量）。
step表示阈值下降的程度（我们从1开始，表示完美相似度，在每次迭代中逐步降低）。
change_multiplier - 如果专家回答概念不相关（或者大多数不相关，因为返回了多个），则通过这个浮点数来乘以步长。它用于在每次迭代中精确定位阈值的确切位置。
概念根据它们的相似性排序（相似度越高，排名越靠前）。

以下函数向专家询问意见，并根据他的答案找到最佳阈值。

def _ask_expert(self, available_concepts_indices):
    # Get random concepts (the ones above the threshold)
    concepts_to_show = set(
        np.random.choice(
            available_concepts_indices, len(available_concepts_indices)
        ).tolist()
    )
    # Remove those already presented to an expert
    concepts_to_show = concepts_to_show - self._checked_concepts
    self._checked_concepts.update(concepts_to_show)
    # Print message for an expert and concepts to be classified
    if concepts_to_show:
        print("\nAre those concepts related to medicine?\n")
        print(
            "\n".join(
                f"{i}. {concept}"
                for i, concept in enumerate(
                    self.concepts[list(concepts_to_show)[: self.samples]]
                )
            ),
            "\n",
        )
        return input("[y]es / [n]o / [any]quit ")
    return "y"

示例问题如下：

Are those concepts related to medicine?                                                      

0. anesthetic drug                                                                                                                                                                         
1. child and adolescent psychiatry                                                                                                                                                         
2. tertiary care center                                                     
3. sex therapy                           
4. drug design                                                                                                                                                                             
5. pain disorder                                                      
6. psychiatric rehabilitation                                                                                                                                                              
7. combined oral contraceptive                                
8. family practitioner committee                           
9. cancer family syndrome                          
10. social psychology                                                                                                                                                                      
11. drug sale                                                                                                           
12. blood system                                                                        

[y]es / [n]o / [any]quit y

...解析专家的答案:

# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
    if decision.lower() == "y":
        # You can't go higher as current threshold is related to medicine
        self._max_threshold = self.threshold_
        if self.threshold_ - self.step < self._min_threshold:
            return False
        # Lower the threshold
        self.threshold_ -= self.step
        return True
    if decision.lower() == "n":
        # You can't got lower than this, as current threshold is not related to medicine already
        self._min_threshold = self.threshold_
        # Multiply threshold to pinpoint exact spot
        self.step *= self.change_multiplier
        if self.threshold_ + self.step < self._max_threshold:
            return False
        # Lower the threshold
        self.threshold_ += self.step
        return True
    return False

最后，整个ActiveLearner代码，根据专家找到相似性的最佳阈值：

class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        samples: int,
        max_steps: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.samples: int = samples
        self.max_steps: int = max_steps
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

    def _ask_expert(self, available_concepts_indices):
        # Get random concepts (the ones above the threshold)
        concepts_to_show = set(
            np.random.choice(
                available_concepts_indices, len(available_concepts_indices)
            ).tolist()
        )
        # Remove those already presented to an expert
        concepts_to_show = concepts_to_show - self._checked_concepts
        self._checked_concepts.update(concepts_to_show)
        # Print message for an expert and concepts to be classified
        if concepts_to_show:
            print("\nAre those concepts related to medicine?\n")
            print(
                "\n".join(
                    f"{i}. {concept}"
                    for i, concept in enumerate(
                        self.concepts[list(concepts_to_show)[: self.samples]]
                    )
                ),
                "\n",
            )
            return input("[y]es / [n]o / [any]quit ")
        return "y"

    # True - keep asking, False - stop the algorithm
    def _parse_expert_decision(self, decision) -> bool:
        if decision.lower() == "y":
            # You can't go higher as current threshold is related to medicine
            self._max_threshold = self.threshold_
            if self.threshold_ - self.step < self._min_threshold:
                return False
            # Lower the threshold
            self.threshold_ -= self.step
            return True
        if decision.lower() == "n":
            # You can't got lower than this, as current threshold is not related to medicine already
            self._min_threshold = self.threshold_
            # Multiply threshold to pinpoint exact spot
            self.step *= self.change_multiplier
            if self.threshold_ + self.step < self._max_threshold:
                return False
            # Lower the threshold
            self.threshold_ += self.step
            return True
        return False

    def fit(self):
        for _ in range(self.max_steps):
            available_concepts_indices = np.nonzero(
                self.concepts_similarity >= self.threshold_
            )[0]
            if available_concepts_indices.size != 0:
                decision = self._ask_expert(available_concepts_indices)
                if not self._parse_expert_decision(decision):
                    break
            else:
                self.threshold_ -= self.step
        return self

总的来说，你需要手动回答一些问题，但在我看来，这种方法更加准确。此外，您不必查看所有样本，只需查看其中的一小部分即可。您可以决定多少个样本构成一个医学术语（例如，是否应该仍将40个医学样本和10个非医学样本视为医学？），这使您可以根据自己的喜好进行微调。如果有异常值（比如说，在50个样本中有1个非医学样本），我认为阈值仍然有效。

再次强调：为了最大程度地减少错误分类的可能性，应该将此方法与其他方法混合使用。

分类器

当我们从专家那里获得阈值时，分类将是瞬间完成的。以下是一个简单的分类类：

class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions

为了简洁起见，这里是最终的源代码：

import json
import typing

import numpy as np
import spacy


class Similarity:
    def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
        # In our case it will be medicine
        self.centroid = centroid

        # spaCy's Language model (english), which will be used to return similarity to
        # centroid of each concept
        self.nlp = nlp
        self.n_threads: int = n_threads
        self.batch_size: int = batch_size

        self.missing: typing.List[int] = []

    def __call__(self, concepts):
        concepts_similarity = []
        # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
        for i, concept in enumerate(
            self.nlp.pipe(
                concepts, n_threads=self.n_threads, batch_size=self.batch_size
            )
        ):
            if concept.has_vector:
                concepts_similarity.append(self.centroid.similarity(concept))
            else:
                # If document has no vector, it's assumed to be totally dissimilar to centroid
                concepts_similarity.append(-1)
                self.missing.append(i)

        return np.array(concepts_similarity)


class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        samples: int,
        max_steps: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.samples: int = samples
        self.max_steps: int = max_steps
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

    def _ask_expert(self, available_concepts_indices):
        # Get random concepts (the ones above the threshold)
        concepts_to_show = set(
            np.random.choice(
                available_concepts_indices, len(available_concepts_indices)
            ).tolist()
        )
        # Remove those already presented to an expert
        concepts_to_show = concepts_to_show - self._checked_concepts
        self._checked_concepts.update(concepts_to_show)
        # Print message for an expert and concepts to be classified
        if concepts_to_show:
            print("\nAre those concepts related to medicine?\n")
            print(
                "\n".join(
                    f"{i}. {concept}"
                    for i, concept in enumerate(
                        self.concepts[list(concepts_to_show)[: self.samples]]
                    )
                ),
                "\n",
            )
            return input("[y]es / [n]o / [any]quit ")
        return "y"

    # True - keep asking, False - stop the algorithm
    def _parse_expert_decision(self, decision) -> bool:
        if decision.lower() == "y":
            # You can't go higher as current threshold is related to medicine
            self._max_threshold = self.threshold_
            if self.threshold_ - self.step < self._min_threshold:
                return False
            # Lower the threshold
            self.threshold_ -= self.step
            return True
        if decision.lower() == "n":
            # You can't got lower than this, as current threshold is not related to medicine already
            self._min_threshold = self.threshold_
            # Multiply threshold to pinpoint exact spot
            self.step *= self.change_multiplier
            if self.threshold_ + self.step < self._max_threshold:
                return False
            # Lower the threshold
            self.threshold_ += self.step
            return True
        return False

    def fit(self):
        for _ in range(self.max_steps):
            available_concepts_indices = np.nonzero(
                self.concepts_similarity >= self.threshold_
            )[0]
            if available_concepts_indices.size != 0:
                decision = self._ask_expert(available_concepts_indices)
                if not self._parse_expert_decision(decision):
                    break
            else:
                self.threshold_ -= self.step
        return self


class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions


if __name__ == "__main__":
    nlp = spacy.load("en_vectors_web_lg")

    centroid = nlp("medicine")

    concepts = json.load(open("concepts_new.txt"))
    concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
        concepts
    )

    learner = ActiveLearner(
        np.array(concepts), concepts_similarity, samples=20, max_steps=50
    ).fit()
    print(f"Found threshold {learner.threshold_}\n")

    classifier = Classifier(centroid, learner.threshold_)
    pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096)
    predictions = classifier.predict(pipe)
    print(
        "\n".join(
            f"{concept}: {label}"
            for concept, label in zip(concepts[20:40], predictions[20:40])
        )
    )

回答了一些问题后，使用阈值0.1（介于[-1, 0.1)之间的所有内容都被认为是非医学的，而[0.1, 1]则被认为是医学的），我得到了以下结果：

kartagener s syndrome: True
summer season: True
taq: False
atypical neuroleptic: True
anterior cingulate: False
acute respiratory distress syndrome: True
circularity: False
mutase: False
adrenergic blocking drug: True
systematic desensitization: True
the turning point: True
9l: False
pyridazine: False
bisoprolol: False
trq: False
propylhexedrine: False
type 18: True
darpp 32: False
rickettsia conorii: False
sport shoe: True

如您所见，这种方法远非完美，因此最后一节描述了可能的改进:

可能的改进

如开头所提到的，使用我的方法与其他答案混合，可能会忽略像sport shoe属于medicine的想法，而主动学习方法在两个上述启发式之间的平局时将更具决定性的投票。

我们也可以创建一个主动学习集合。我们将使用多个阈值（增加或减少），而不是一个阈值，比如说0.1，让我们称其为0.1, 0.2, 0.3, 0.4, 0.5。

假设对于每个阈值，sport shoe都有它相应的True/False

True True False False False,

通过多数表决，我们将以3票对2票标记它为non-medical。此外，如果低于它的阈值超过它（True / False的情况看起来像这样：True True True False False），那么过于严格的阈值也会得到缓解。

我想到的最后一个可能的改进: 在上面的代码中，我使用了Doc向量，它是单词向量的平均值，创建了这个概念。假设有一个单词缺失(向量由零组成)，在这种情况下，它将被推离medicine质心更远。你可能不希望这样(因为一些小众医学术语[缩写如gpv或其他]可能会丢失他们的表现)，在这种情况下，你可以只平均那些不同于零的向量。

我知道这篇文章相当冗长，如果你有任何问题，请在下面发表评论。