我想使用信息增益度量(scikit-learn中的相互信息)来识别DataFrame的前10个最佳特征,并将它们按照信息增益度量得分的升序顺序显示在表格中。
在这个例子中,
什么是使用互信息分数实现这一目标的正确语法?
在这个例子中,
features
是一个DataFrame,其中包含所有能够告诉我们餐厅是否会关闭的有趣训练数据。# Initialization of data and labels
x = features.copy () # "x" contains all training data
y = x ["closed"] # "y" contains the labels of the records in "x"
# Elimination of the class column (closed) of features
x = x.drop ('closed', axis = 1)
# this is x.columns, sorry for the mix french and english
features_columns = ['moyenne_etoiles', 'ville', 'zone', 'nb_restaurants_zone',
'zone_categories_intersection', 'ville_categories_intersection',
'nb_restaurant_meme_annee', 'ecart_type_etoiles', 'tendance_etoiles',
'nb_avis', 'nb_avis_favorables', 'nb_avis_defavorables',
'ratio_avis_favorables', 'ratio_avis_defavorables',
'nb_avis_favorables_mention', 'nb_avis_defavorables_mention',
'nb_avis_favorables_elites', 'nb_avis_defavorables_elites',
'nb_conseils', 'nb_conseils_compliment', 'nb_conseils_elites',
'nb_checkin', 'moyenne_checkin', 'annual_std', 'chaine',
'nb_heures_ouverture_semaine', 'ouvert_samedi', 'ouvert_dimanche',
'ouvert_lundi', 'ouvert_vendredi', 'emporter', 'livraison',
'bon_pour_groupes', 'bon_pour_enfants', 'reservation', 'prix',
'terrasse']
# normalization
std_scale = preprocessing.StandardScaler().fit(features[features_columns])
normalized_data = std_scale.transform(features[features_columns])
labels = np.array(features['closed'])
# split the data
train_features, test_features, train_labels, test_labels = train_test_split(normalized_data, labels, test_size = 0.2, random_state = 42)
labels_true = ?
labels_pred = ?
# I dont really know how to use this function to achieve what i want
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification
# Get the mutual information coefficients and convert them to a data frame
coeff_df =pd.DataFrame(features,
columns=['Coefficient'], index=x.columns)
coeff_df.head()
什么是使用互信息分数实现这一目标的正确语法?