我正在学习使用卡方检验进行特征选择,并发现了类似这个的代码。
然而,我的理解是,较高的卡方分数意味着该特征更加独立(因此对模型的用处更小),因此我们应该关注得分最低的特征。然而,使用scikit-learn的SelectKBest选择器,返回具有最高卡方分数的值。我的卡方检验使用理解是否不正确?还是在sklearn中,卡方分数产生的结果与卡方统计量不同?
请参见下面的代码,了解我所指的内容(除了结尾以外,大部分内容都来自上述链接)。
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import pandas as pd
import numpy as np
# Load iris data
iris = load_iris()
# Create features and target
X = iris.data
y = iris.target
# Convert to categorical data by converting data to integers
X = X.astype(int)
# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
chi2_selector.fit(X, y)
# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(iris.feature_names, chi2_selector.scores_, chi2_selector.pvalues_)), columns=['ftr', 'score', 'pval'])
chi2_scores
# you can see that the kbest returned from SelectKBest
#+ were the two features with the _highest_ score
kbest = np.asarray(iris.feature_names)[chi2_selector.get_support()]
kbest