索引38超出了大小为38的轴1范围 - Sklearn

Question

索引38超出了大小为38的轴1范围 - Sklearn

pythonpandasscikit-learn

3

我在使用Naive Bayes的CategoricalNB算法时遇到了这个错误。

在第二次运行代码后出现了上述错误。这意味着第一次运行代码没有出现任何错误，但是当我尝试更改一些东西(即使只是一个注释)并重启Notebook再次运行时，就会出现这个错误：

IndexError: index 38 is out of bounds for axis 1 with size 38

我不知道哪里出错了以及如何解决。当我重新启动内核并尝试再次运行时，它可以正常工作，但在第一次尝试之后的每一次尝试中都会失败并给出上述错误。

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

dataframe = pd.read_csv("hr_dataset.csv")
# dataframe = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

dataframe.head(2)

from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# inputs = scaled_df
X_train, X_test, y_train, y_test = train_test_split(inputs, target, test_size=0.2)

categoricalNB_ = CategoricalNB()


categoricalNB_.fit(X_train, y_train)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

pred = categoricalNB_.predict(X_test) # --------------> gives the error for every attempt after the 1st attempt. weird

categoricalNB_.score(X_test, y_test)
# accuracy_score(y_test,pred)

- Escort Personal Adz

1

数组索引从0开始，因此如果它有n个元素，则有效（正）索引从0到n-1（包括n-1）。 - norok2

为什么第一次尝试可以工作，但之后就不行了？ - Escort Personal Adz

检查你的数组是否具有正确的形状。 - norok2

是的，它们具有正确的形状，X_train 和 X_test 的行数相等，X_test 和 y_test 也是如此。 - Escort Personal Adz

@norok2 这是我的数据集：你可以在这里找到它。https://drive.google.com/open?id=19gWVwuXS3my9C77D9unG53tuivPzZdqJ - Escort Personal Adz

显示剩余3条评论

3个回答

0

尝试将一些值设置为min_categories，这对我很有帮助。

model = CategoricalNB(min_categories=10)

对于您的数据，可能与10不同

- Sinba

0

另一个虽然不太正规但快速解决这个问题的方法是在进行训练/测试数据分割之前执行以下操作：

c = 10
data[:,i:j] = c * np.round(data[:,i:j] / c)

这个函数将所有特征列（从第i列到第j列，不包括j列）四舍五入到最接近的c的倍数，这里c为10。现在你的列变得不那么独特，因此在测试数据中遇到不在训练集中的数据点的可能性较小。

当然，这假设列的取值范围是整数，但也可以根据实际值进行调整。这可能会影响模型的性能。

- scribe

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Daniel Reiser · Accepted Answer

我认为您的问题与训练集和特征集中具有不同值的特征相关。

我查看了您的数据库，并发现您只有一条记录，其中总工作年限为38。如果该记录仅在测试集中可访问，则来自训练集的拟合将不包含值38的概率，从而导致越界错误。

您可以使用class_prior参数解决此问题（更多详细信息请阅读文档），或者确保每个特征的每个类别至少有一定数量的记录。