有一个针对特征选择的Python库TextFeatureSelection
。该库提供了每个单词标记、二元组和三元组等的得分来衡量其区分能力。
熟悉机器学习中特征选择方法的人们知道,它基于过滤方法,为自然语言处理(NLP)和深度学习模型的工程师提供了必要的工具来提高分类准确性。它有四种方法,即卡方检验、互信息、比例差异 和 信息增益,可帮助选择词汇作为机器学习分类器的特征。
from TextFeatureSelection import TextFeatureSelection
input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']
target=['Positive','Positive','Negative','Neutral','Neutral']
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
编辑:
现在它还具有用于特征选择的遗传算法。
from TextFeatureSelection import TextFeatureSelectionGA
getGAobj=TextFeatureSelectionGA(percentage_of_token=60)
best_vocabulary=getGAobj.getGeneticFeatures(doc_list=doc_list,label_list=label_list)
编辑2
现在还有另一种方法TextFeatureSelectionEnsemble
,它结合了特征选择和集成学习。它通过文档频率阈值进行基模型的特征选择。在集成层,它使用遗传算法来确定最佳的基模型组合,并只保留这些模型。
from TextFeatureSelection import TextFeatureSelectionEnsemble
imdb_data=pd.read_csv('../input/IMDB Dataset.csv')
le = LabelEncoder()
imdb_data['labels'] = le.fit_transform(imdb_data['sentiment'].values)
doc_list=imdb_data['review'].tolist()
label_list=imdb_data['labels'].tolist()
gaObj=TextFeatureSelectionEnsemble(doc_list,label_list,n_crossvalidation=2,pickle_path='/home/user/folder/',average='micro',base_model_list=['LogisticRegression','RandomForestClassifier','ExtraTreesClassifier','KNeighborsClassifier'])
best_columns=gaObj.doTFSE()`
请查看项目详情:https://pypi.org/project/TextFeatureSelection/