使用Spacy进行文本分类训练

7

在阅读了文档并完成了教程之后,我决定制作一个小型演示。结果发现我的模型无法训练。以下是代码:

import spacy
import random
import json

TRAINING_DATA = [
    ["My little kitty is so special", {"KAT": True}],
    ["Dude, Totally, Yeah, Video Games", {"KAT": False}],
    ["Should I pay $1,000 for the iPhone X?", {"KAT": False}],
    ["The iPhone 8 reviews are here", {"KAT": False}],
    ["Noa is a great cat name.", {"KAT": True}],
    ["We got a new kitten!", {"KAT": True}]
]

nlp = spacy.blank("en")
category = nlp.create_pipe("textcat")
nlp.add_pipe(category)
category.add_label("KAT")

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(100):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [{"textcat": [entities]} for text, entities in batch]
        nlp.update(texts, annotations, losses=losses)
    if itn % 20 == 0:
        print(losses)

当我运行这个程序时,输出表明很少有东西被学习了。
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}
{'textcat': 0.0}

这感觉很不对。应该有一个错误或有意义的标签。预测结果也证实了这一点。

for text, d in TRAINING_DATA:
    print(text, nlp(text).cats)

# Dude, Totally, Yeah, Video Games {'KAT': 0.45303162932395935}
# The iPhone 8 reviews are here {'KAT': 0.45303162932395935}
# Noa is a great cat name. {'KAT': 0.45303162932395935}
# Should I pay $1,000 for the iPhone X? {'KAT': 0.45303162932395935}
# We got a new kitten! {'KAT': 0.45303162932395935}
# My little kitty is so special {'KAT': 0.45303162932395935}

感觉我的代码缺失了一些东西,但我无法找出是什么。


他们在这里使用了2000个例子。你确定机器学习只需要6个例子吗?你的三个猫的例子都用了不同的单词来描述猫。我建议从只有一个单词描述猫的10个不同例子开始。 - Manuel
当然可以,但是textcat类别报告的损失为零,这不应该是这样的。 - cantdutchthis
3
你的训练循环和数据看起来正确,我想我找到了问题所在:请将{"textcat": [entities]}更改为{"cats": entities}(如果你正在传递注释字典,请参见此处的预期键)。当你更新文本分类器时,它会查找键"cats",但是只有"textcat"。因此,你基本上是在没有任何内容的情况下更新文本分类器,最终只得到从nlp.begin_training随机初始化的权重。 - Ines Montani
2个回答

8
如果您更新并使用spaCy 3 - 上面的代码将不再起作用。解决方案是进行一些更改进行迁移。我已经相应地修改了cantdutchthis的示例。
更改摘要:
- 使用配置更改体系结构。旧的默认值是“词袋”,新的默认值是使用注意力的“文本集合”。在调整模型时请记住这一点。 - 标签现在需要进行one-hot编码 - add_pipe接口略有改变 - nlp.update现在需要一个Example对象,而不是textannotation元组。
import spacy
# Add imports for example, as well as textcat config...
from spacy.training import Example
from spacy.pipeline.textcat import single_label_bow_config, single_label_default_config
from thinc.api import Config
import random

# labels should be one-hot encoded
TRAINING_DATA = [
    ["My little kitty is so special", {"KAT0": True}],
    ["Dude, Totally, Yeah, Video Games", {"KAT1": True}],
    ["Should I pay $1,000 for the iPhone X?", {"KAT1": True}],
    ["The iPhone 8 reviews are here", {"KAT1": True}],
    ["Noa is a great cat name.", {"KAT0": True}],
    ["We got a new kitten!", {"KAT0": True}]
]


# bow
# config = Config().from_str(single_label_bow_config)

# textensemble with attention
config = Config().from_str(single_label_default_config)

nlp = spacy.blank("en")
# now uses `add_pipe` instead
category = nlp.add_pipe("textcat", last=True, config=config)
category.add_label("KAT0")
category.add_label("KAT1")


# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(100):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=4):
        texts = [nlp.make_doc(text) for text, entities in batch]
        annotations = [{"cats": entities} for text, entities in batch]

        # uses an example object rather than text/annotation tuple
        examples = [Example.from_dict(doc, annotation) for doc, annotation in zip(
            texts, annotations
        )]
        nlp.update(examples, losses=losses)
    if itn % 20 == 0:
        print(losses)

看起来配置变量已经被初始化了,但是没有在任何地方使用?文本猫模型如何选择配置? - kjsr7

7

根据Ines的评论,这就是答案。

import spacy
import random
import json

TRAINING_DATA = [
    ["My little kitty is so special", {"KAT": True}],
    ["Dude, Totally, Yeah, Video Games", {"KAT": False}],
    ["Should I pay $1,000 for the iPhone X?", {"KAT": False}],
    ["The iPhone 8 reviews are here", {"KAT": False}],
    ["Noa is a great cat name.", {"KAT": True}],
    ["We got a new kitten!", {"KAT": True}]
]

nlp = spacy.blank("en")
category = nlp.create_pipe("textcat")
category.add_label("KAT")
nlp.add_pipe(category)

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(100):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=1):
        texts = [nlp(text) for text, entities in batch]
        annotations = [{"cats": entities} for text, entities in batch]
        nlp.update(texts, annotations, losses=losses)
    if itn % 20 == 0:
        print(losses)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接