Sklearn Chi2 用于特征选择

Question

Sklearn Chi2 用于特征选择

pythonmachine-learningscikit-learnfeature-selectionchi-squared

13

我正在学习使用卡方检验进行特征选择，并发现了类似这个的代码。

然而，我的理解是，较高的卡方分数意味着该特征更加独立（因此对模型的用处更小），因此我们应该关注得分最低的特征。然而，使用scikit-learn的SelectKBest选择器，返回具有最高卡方分数的值。我的卡方检验使用理解是否不正确？还是在sklearn中，卡方分数产生的结果与卡方统计量不同？

请参见下面的代码，了解我所指的内容（除了结尾以外，大部分内容都来自上述链接）。

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import pandas as pd
import numpy as np

# Load iris data
iris = load_iris()

# Create features and target
X = iris.data
y = iris.target

# Convert to categorical data by converting data to integers
X = X.astype(int)

# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
chi2_selector.fit(X, y)

# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(iris.feature_names, chi2_selector.scores_, chi2_selector.pvalues_)), columns=['ftr', 'score', 'pval'])
chi2_scores

# you can see that the kbest returned from SelectKBest 
#+ were the two features with the _highest_ score
kbest = np.asarray(iris.feature_names)[chi2_selector.get_support()]
kbest

- RSHAP

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jose_bacoy · Accepted Answer

你的理解是错误的。

chi2检验的零假设是“两个分类变量相互独立”。因此，更高的chi2统计量值意味着“两个分类变量有关联”，对于分类更有用。

SelectKBest会基于更高的chi2值，选择出最好的两个特征（k=2）。因此，你需要获取它所提供的那些特征，而不是从chi2选择器中获取“其他特征”。

你正确地从chi2_selector.scores_获取了chi2统计值，并从chi2_selector.get_support()获取了最佳特征。它将给您基于独立性检验的chi2测试中，“花瓣长度（cm）”和“花瓣宽度（cm）”作为前两个最佳特征。希望这能澄清该算法。