NLTK. 检测一个句子是否是疑问句?

13

我想使用NLTK或者最适合的库创建一个Python脚本,以正确识别给定的句子是疑问句(即问题)还是非疑问句。我尝试过使用正则表达式,但有更深层次的情况,正则表达式无法解决。因此想使用自然语言处理,有人能帮忙吗!


1
当你说疑问句并且使用了正则表达式,你是在寻找比仅仅检查是否有问号更深层次的东西吗?你可能会发现这个链接很有用 https://dev59.com/LWMm5IYBdhLWcg3wFL_s - alexbhandari
我已经阅读过那篇帖子,问题是我是一个初学者,答案的复杂度很高。我正在尝试寻找一个简单的解决方案,如果存在的话。 - Freakant
复杂性取决于您对疑问句的标准,您应该在问题中澄清这一点。如果您只是想查找问号的存在,这很容易。如果您想通过寻找疑问词(什么、为什么、如何等)来识别问题而不是寻找标点符号,也不太困难。然而,如果您想普遍地识别任何类型的问题(例如,“这好吗”),那么这可能更棘手,需要像上面的帖子一样复杂的解决方案。 - alexbhandari
4个回答

16
这篇文章可能会解决您的问题。

以下是代码:

import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]


def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

这应该会打印出类似于0.67的数字,这是相当不错的准确率。 如果你想通过这个分类器处理一串文本,请尝试:

print(classifier.classify(dialogue_act_features(line)))

你可以将字符串分类为问句、陈述句等,并提取所需内容。

这种方法使用了朴素贝叶斯算法,我认为这是最简单的方法,但当然还有许多其他处理方式。希望这可以帮助到你!


我可以添加自定义的训练数据吗?比如,如果我检查“我是否需要在Jupyter中使用Anaconda”,它会将其显示为语句。 - Sunil Garg
您使用什么方法来获取每个“检测到”的问题? - danywigglebutt

4

根据@PolkaDot的答案,我创建了一个使用NLTK和一些自定义代码以获得更高精度的函数。

posts = nltk.corpus.nps_chat.xml_posts()[:10000]

def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]

# 10% of the total data
size = int(len(featuresets) * 0.1)

# first 10% for test_set to check the accuracy, and rest 90% after the first 10% for training
train_set, test_set = featuresets[size:], featuresets[:size]

# get the classifer from the training set
classifier = nltk.NaiveBayesClassifier.train(train_set)
# to check the accuracy - 0.67
# print(nltk.classify.accuracy(classifier, test_set))

question_types = ["whQuestion","ynQuestion"]
def is_ques_using_nltk(ques):
    question_type = classifier.classify(dialogue_act_features(ques)) 
    return question_type in question_types

然后

question_pattern = ["do i", "do you", "what", "who", "is it", "why","would you", "how","is there",
                    "are there", "is it so", "is this true" ,"to know", "is that true", "are we", "am i", 
                   "question is", "tell me more", "can i", "can we", "tell me", "can you explain",
                   "question","answer", "questions", "answers", "ask"]

helping_verbs = ["is","am","can", "are", "do", "does"]
# check with custom pipeline if still this is a question mark it as a question
def is_question(question):
    question = question.lower().strip()
    if not is_ques_using_nltk(question):
        is_ques = False
        # check if any of pattern exist in sentence
        for pattern in question_pattern:
            is_ques  = pattern in question
            if is_ques:
                break

        # there could be multiple sentences so divide the sentence
        sentence_arr = question.split(".")
        for sentence in sentence_arr:
            if len(sentence.strip()):
                # if question ends with ? or start with any helping verb
                # word_tokenize will strip by default
                first_word = nltk.word_tokenize(sentence)[0]
                if sentence.endswith("?") or first_word in helping_verbs:
                    is_ques = True
                    break
        return is_ques    
    else:
        return True

您只需要使用is_question方法来检查传入的句子是否是问题。


3

使用sklearn库简单地应用梯度提升算法,您可以改进PolkaDot解决方案并达到约86%的准确度。具体操作如下:

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()


posts_text = [post.text for post in posts]

#divide train and test in 80 20
train_text = posts_text[:int(len(posts_text)*0.8)]
test_text = posts_text[int(len(posts_text)*0.2):]

#Get TFIDF features
vectorizer = TfidfVectorizer(ngram_range=(1,3), 
                             min_df=0.001, 
                             max_df=0.7, 
                             analyzer='word')

X_train = vectorizer.fit_transform(train_text)
X_test = vectorizer.transform(test_text)

y = [post.get('class') for post in posts]

y_train = y[:int(len(posts_text)*0.8)]
y_test = y[int(len(posts_text)*0.2):]

# Fitting Gradient Boosting classifier to the Training set
gb = GradientBoostingClassifier(n_estimators = 400, random_state=0)
#Can be improved with Cross Validation

gb.fit(X_train, y_train)

predictions_rf = gb.predict(X_test)

#Accuracy of 86% not bad
print(classification_report(y_test, predictions_rf))

然后您可以使用 gb.predict(vectorizer.transform(['新的句子在这里']) 来对新数据进行预测。


“我需要安装Anaconda才能使用Jupyter吗?” 这是一个问题,而不是陈述句。 - Sunil Garg
1
它的准确率是86%,并非100%。 - Jerry Fanelli
有没有办法添加这样的训练数据,以便它可以将这样的问题标记为问题? - Sunil Garg

1

在之前的回答基础上,如果你的唯一任务是构建一个二元分类器,用于判断给定的句子是否是一个问题。

我宁愿训练一个二元分类器。你可以先预处理标签并创建二元标签。然后再训练分类器。

这将提高你训练的分类器的准确率达到0.864

import nltk

nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]

def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

def generate_binary_feature(label):
    if label in ['whQuestion', 'yAnswer','ynQuestion']:
        return True
    else:
        return False

featuresets = [(dialogue_act_features(post.text), generate_binary_feature(post.get('class'))) for post in posts]

# 10% of the total data
size = int(len(featuresets) * 0.1)

# first 10% for test_set to check the accuracy, and rest 90% after the first 10% for training
train_set, test_set = featuresets[size:], featuresets[:size]

# get the classifer from the training set
classifier = nltk.NaiveBayesClassifier.train(train_set)
# to check the accuracy
print(nltk.classify.accuracy(classifier, test_set))

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接