我想使用NLTK或者最适合的库创建一个Python脚本,以正确识别给定的句子是疑问句(即问题)还是非疑问句。我尝试过使用正则表达式,但有更深层次的情况,正则表达式无法解决。因此想使用自然语言处理,有人能帮忙吗!
我想使用NLTK或者最适合的库创建一个Python脚本,以正确识别给定的句子是疑问句(即问题)还是非疑问句。我尝试过使用正则表达式,但有更深层次的情况,正则表达式无法解决。因此想使用自然语言处理,有人能帮忙吗!
以下是代码:
import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
这应该会打印出类似于0.67的数字,这是相当不错的准确率。 如果你想通过这个分类器处理一串文本,请尝试:
print(classifier.classify(dialogue_act_features(line)))
你可以将字符串分类为问句、陈述句等,并提取所需内容。
这种方法使用了朴素贝叶斯算法,我认为这是最简单的方法,但当然还有许多其他处理方式。希望这可以帮助到你!
根据@PolkaDot的答案,我创建了一个使用NLTK和一些自定义代码以获得更高精度的函数。
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
# 10% of the total data
size = int(len(featuresets) * 0.1)
# first 10% for test_set to check the accuracy, and rest 90% after the first 10% for training
train_set, test_set = featuresets[size:], featuresets[:size]
# get the classifer from the training set
classifier = nltk.NaiveBayesClassifier.train(train_set)
# to check the accuracy - 0.67
# print(nltk.classify.accuracy(classifier, test_set))
question_types = ["whQuestion","ynQuestion"]
def is_ques_using_nltk(ques):
question_type = classifier.classify(dialogue_act_features(ques))
return question_type in question_types
然后
question_pattern = ["do i", "do you", "what", "who", "is it", "why","would you", "how","is there",
"are there", "is it so", "is this true" ,"to know", "is that true", "are we", "am i",
"question is", "tell me more", "can i", "can we", "tell me", "can you explain",
"question","answer", "questions", "answers", "ask"]
helping_verbs = ["is","am","can", "are", "do", "does"]
# check with custom pipeline if still this is a question mark it as a question
def is_question(question):
question = question.lower().strip()
if not is_ques_using_nltk(question):
is_ques = False
# check if any of pattern exist in sentence
for pattern in question_pattern:
is_ques = pattern in question
if is_ques:
break
# there could be multiple sentences so divide the sentence
sentence_arr = question.split(".")
for sentence in sentence_arr:
if len(sentence.strip()):
# if question ends with ? or start with any helping verb
# word_tokenize will strip by default
first_word = nltk.word_tokenize(sentence)[0]
if sentence.endswith("?") or first_word in helping_verbs:
is_ques = True
break
return is_ques
else:
return True
您只需要使用is_question
方法来检查传入的句子是否是问题。
使用sklearn库简单地应用梯度提升算法,您可以改进PolkaDot解决方案并达到约86%的准确度。具体操作如下:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()
posts_text = [post.text for post in posts]
#divide train and test in 80 20
train_text = posts_text[:int(len(posts_text)*0.8)]
test_text = posts_text[int(len(posts_text)*0.2):]
#Get TFIDF features
vectorizer = TfidfVectorizer(ngram_range=(1,3),
min_df=0.001,
max_df=0.7,
analyzer='word')
X_train = vectorizer.fit_transform(train_text)
X_test = vectorizer.transform(test_text)
y = [post.get('class') for post in posts]
y_train = y[:int(len(posts_text)*0.8)]
y_test = y[int(len(posts_text)*0.2):]
# Fitting Gradient Boosting classifier to the Training set
gb = GradientBoostingClassifier(n_estimators = 400, random_state=0)
#Can be improved with Cross Validation
gb.fit(X_train, y_train)
predictions_rf = gb.predict(X_test)
#Accuracy of 86% not bad
print(classification_report(y_test, predictions_rf))
然后您可以使用 gb.predict(vectorizer.transform(['新的句子在这里'])
来对新数据进行预测。
在之前的回答基础上,如果你的唯一任务是构建一个二元分类器,用于判断给定的句子是否是一个问题。
我宁愿训练一个二元分类器。你可以先预处理标签并创建二元标签。然后再训练分类器。
这将提高你训练的分类器的准确率达到0.864
import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
def generate_binary_feature(label):
if label in ['whQuestion', 'yAnswer','ynQuestion']:
return True
else:
return False
featuresets = [(dialogue_act_features(post.text), generate_binary_feature(post.get('class'))) for post in posts]
# 10% of the total data
size = int(len(featuresets) * 0.1)
# first 10% for test_set to check the accuracy, and rest 90% after the first 10% for training
train_set, test_set = featuresets[size:], featuresets[:size]
# get the classifer from the training set
classifier = nltk.NaiveBayesClassifier.train(train_set)
# to check the accuracy
print(nltk.classify.accuracy(classifier, test_set))