Python sklearn - 确定LabelEncoder的编码顺序

Question

Python sklearn - 确定LabelEncoder的编码顺序

6

我希望能够确定sklearn LabelEncoder的标签（即0,1,2,3等）以适应分类变量可能值的特定顺序（比如 ['b', 'a', 'c', 'd']）。正如在以下示例中所示，LabelEncoder 选择按字典序拟合标签：

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(['b', 'a', 'c', 'd' ])
le.classes_
array(['a', 'b', 'c', 'd'], dtype='<U1')
le.transform(['a', 'b'])
array([0, 1])

我该如何强制编码器按照在.fit方法中首先遇到的数据顺序进行编码（即将'b'编码为0，将'a'编码为1，将'c'编码为2，将'd'编码为3）？

- Amitai

我认为你需要使用 OrdinalEncoder，它在 https://github.com/scikit-learn-contrib/categorical-encoding 和 http://contrib.scikit-learn.org/categorical-encoding/ordinal.html 中有详细描述。 - dgumo

4个回答

2

请注意，现在有一种更好的方法可以使用 http://contrib.scikit-learn.org/categorical-encoding/ordinal.html。特别是，请查看mapping参数：

用于编码的类到标签的映射，可选。字典包含键“col”和“mapping”。 “col”的值应为特征名称。“mapping”的值应该是一个“original_label”到“encoded_label”的字典。示例映射：[{‘col’: ‘col1’，‘mapping’：{None: 0, ‘a’: 1, ‘b’: 2}}]

- Vincent

老的方式，现在不支持，链接也无效。 - shantanu pathak

2

注意：这不是一种标准方法，而是一种巧妙的方法。我使用了“classes_”属性来自定义我的映射。

from sklearn import preprocessing
le_temp = preprocessing.LabelEncoder()
le_temp = le_temp.fit(df_1['Temp'])
print(df_1['Temp'])
le_temp.classes_ = np.array(['Cool', 'Mild','Hot'])
print("New classes sequence::",le_temp.classes_)
df_1['Temp'] = le_temp.transform(df_1['Temp'])
print(df_1['Temp'])

我的输出看起来像：

1      Hot
2      Hot
3      Hot
4     Mild
5     Cool
6     Cool

Name: Temp, dtype: object
New classes sequence:: ['Cool' 'Mild' 'Hot']

1     2
2     2
3     2
4     1
5     0
6     0

Name: Temp, dtype: int32

- shantanu pathak

1

"

Vivek Kumar的解决方案对我有用，但必须这样做。

"

class LabelEncoder(LabelEncoder):

def fit(self, y):
    y = column_or_1d(y, warn=True)
    self.classes_ = pd.Series(y).unique().sort()
    return self

- R. Márquez

1

整个问题的想法是不要对类的顺序进行排序。这就是为什么我选择在我的答案中不这样做的原因。 - Vivek Kumar

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Vivek Kumar · Accepted Answer

您不能在原始版本中这样做。 LabelEncoder.fit() 使用 numpy.unique，该函数总是以排序的方式返回数据，正如源代码所述：given in source:

def fit(...):
    y = column_or_1d(y, warn=True)
    self.classes_ = np.unique(y)
    return self

所以如果你想这样做，你需要覆盖fit()函数。就像这样：

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import column_or_1d

class MyLabelEncoder(LabelEncoder):

    def fit(self, y):
        y = column_or_1d(y, warn=True)
        self.classes_ = pd.Series(y).unique()
        return self

然后，您可以这样做：

le = MyLabelEncoder()
le.fit(['b', 'a', 'c', 'd' ])
le.classes_
#Output:  array(['b', 'a', 'c', 'd'], dtype=object)

在这里，我使用 pandas.Series.unique() 来获取唯一的类别。如果出于任何原因无法使用 pandas，请参考此问题，该问题使用 numpy 进行操作：

numpy unique without sort