使用Scikit-Learn OneHotEncoder处理Pandas DataFrame数据

33

我试图使用Scikit-Learn的OneHotEncoder将一个包含字符串的Pandas DataFrame列替换为其one-hot编码的等价物。 我下面的代码不起作用:

from sklearn.preprocessing import OneHotEncoder
# data is a Pandas DataFrame

jobs_encoder = OneHotEncoder()
jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))

它会产生以下错误(列表中的字符串被省略):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-91-3a1f568322f5> in <module>()
      3 jobs_encoder = OneHotEncoder()
      4 jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
----> 5 data['Profession'] = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in transform(self, X)
    730                                        copy=True)
    731         else:
--> 732             return self._transform_new(X)
    733 
    734     def inverse_transform(self, X):

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform_new(self, X)
    678         """New implementation assuming categorical input"""
    679         # validation of X happens in _check_X called by _transform
--> 680         X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
    681 
    682         n_samples, n_features = X_int.shape

/usr/local/anaconda3/envs/ml/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
    120                     msg = ("Found unknown categories {0} in column {1}"
    121                            " during transform".format(diff, i))
--> 122                     raise ValueError(msg)
    123                 else:
    124                     # Set the problematic rows to an acceptable value and

ValueError: Found unknown categories ['...', ..., '...'] in column 0 during transform

这是一些示例数据:

data['Profession'] =

0         unkn
1         safe
2         rece
3         unkn
4         lead
          ... 
111988    indu
111989    seni
111990    mess
111991    seni
111992    proj
Name: Profession, Length: 111993, dtype: object

我到底做错了什么?


请包含完整的错误跟踪,以及您的data['Profession']的示例。 - desertnaut
1
独热编码器将返回大小为 data_length x num_categories 的二维数组。您不能给单个列 df ['Profession'] 分配值。 - Quang Hoang
1
关于“dd answer”的后续。我们可以使用OneHotEncoder处理多列数据,但不能使用LabelBinarizer和LabelEncoder。https://dev59.com/qlUL5IYBdhLWcg3wWG5I#54119850 - Novice
7个回答

44

OneHotEncoder 将分类整数特征编码为一维热编码数组。如果 sparse=True,则其Transform方法返回一个稀疏矩阵,否则返回一个二维数组。

你不能将 2 维数组(或稀疏矩阵)转换为 Pandas Series。你必须为每个类别创建一个 Pandas Series(即 Pandas 数据框中的一列)。

我建议使用 pandas.get_dummies 来代替:

data = pd.get_dummies(data,prefix=['Profession'], columns = ['Profession'], drop_first=True)

编辑:

使用Sklearn OneHotEncoder:

transformed = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))
#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(transformed, columns=jobs_encoder.get_feature_names())
#concat with original data
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)

其他选项:如果您正在使用GridSearch进行超参数调整,则建议使用ColumnTransformerFeatureUnionPipeline或直接使用make_column_transformer


3
我希望能够将实例储存起来,以备未来在新的数据上使用,这就是为什么我想使用OneHotEncoder,而使用get_dummies不能做到这一点,对吗? - dd.
1
对的。如果你想在新数据上使用它,就不能使用get_dummies。 - Abel Paz
在这个(正确的)sklearn OneHotEncoder解决方案中,原始代码存在问题。原始代码为jobs_encoder.fit(data['Profession'].unique().reshape(1, -1)),但应该是jobs_encoder.fit(data['Profession'].unique().reshape(-1, 1))。我在尝试解决方案时发现了这个问题。 - RVS
1
现在使用.get_feature_names_out()获取特征名称。 - QHarr
这个答案帮助我基本上解决了问题。为了使concat正常工作,你还需要将新的df的索引与原始df对齐:ohe_df = pd.DataFrame(transformed, columns=jobs_encoder.get_feature_names_out(), index=data.index) - maccaroo

22

结果发现Scikit-Learn的LabelBinarizer帮助我更好地将数据转换为一种独热编码格式,结合Amnie的解决方案。我最终的代码如下:

import pandas as pd
from sklearn.preprocessing import LabelBinarizer

jobs_encoder = LabelBinarizer()
jobs_encoder.fit(data['Profession'])
transformed = jobs_encoder.transform(data['Profession'])
ohe_df = pd.DataFrame(transformed)
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)

1
这会使您失去特征名称,并且它仅适用于单个列,因为它被设计为应用于目标变量。 - Woodly0

9
以下是Kaggle Learn提供的一种方法。目前从原始pandas DataFrame到One-hot编码的DataFrame没有更简单的方法。
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
numeric_X_train = X_train.drop(low_cardinality_cols, axis=1)
numeric_X_valid = X_valid.drop(low_cardinality_cols, axis=1)

# Add one-hot encoded columns to numerical features
new_X_train = pd.concat([numeric_X_train, OH_cols_train], axis=1)
new_X_valid = pd.concat([numeric_X_valid, OH_cols_valid], axis=1)
print(new_X_train)

2
由于scikit-learn的新set_output API,现在可以在应用OneHotEncoder后获得Dataframe输出。谢谢!
from sklearn.preprocessing import OneHotEncoder
oh= OneHotEncoder(sparse_output=False).set_output(transform="pandas")
one_hot_encoded=oh.fit_transform(df[["Profession"]])
df = pd.concat([df,one_hot_encoded],axis=1).drop(columns=["Profession"])

请参考:https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html

0

这样做就可以了。如果你对可视化不感兴趣,可以删除 plotly 部分。如果你不需要 markdown,还可以将 printmd 替换为 print。

def fn_cat_onehot(df):

    """Generate onehoteencoded features for all categorical columns in df"""

    printmd(f"df shape: {df.shape}")

    # NaN handing
    nan_count = df.isna().sum().sum()
    if nan_count > 0:
        printmd(f"NaN = **{nan_count}** will be categorized under feature_nan columns")

    # generation
    from sklearn.preprocessing import OneHotEncoder

    model_oh = OneHotEncoder(handle_unknown="ignore", sparse=False)
    for c in df.select_dtypes("category").columns:
        printmd(f"Encoding **{c}**")  # which column
        matrix = model_oh.fit_transform(
            df[[c]]
        )  # get a matrix of new features and values
        names = model_oh.get_feature_names_out()  # get names for these features
        df_oh = pd.DataFrame(
            data=matrix, columns=names, index=df.index
        )  # create df of these new features
        display(df_oh.plot.hist())
        df = pd.concat([df, df_oh], axis=1)  # concat with existing df
        df.drop(
            c, axis=1, inplace=True
        )  # drop categorical column so that it is all numerical for modelling

    printmd(f"#### New df shape: **{df.shape}**")
    return df

'Series'对象没有属性'select_dtypes'。当我使用函数fn_cat_onehot(df_train['property_type'])时,会收到此错误。 - Nishita
@Nishita,你需要将它作为一个数据框传递而不是一个序列。尝试快速执行“pd.DataFrame(series_name)”。 - Indresh Kumar

0

我将@IndreshKumar的解决方案封装成一个sklearn转换器:

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder

class CategoricalOneHot(BaseEstimator, TransformerMixin):
    def __init__(self, list_key_words=None):
        self.oh_dict = {}
        self.list_key_words = list_key_words

    def fit(self, X, y=None):
        self.list_cat_col = []
        for key_word in self.list_key_words:
            self.list_cat_col += [col for col in X.columns if key_word in col]
        for col in self.list_cat_col:
            oh = OneHotEncoder(handle_unknown="ignore", sparse=False)
            oh.fit(X[[col]])
            names = oh.get_feature_names_out()
            self.oh_dict[col] = (oh, names)
        return self

    def transform(self, X):
        _X = X.copy()
        for col in self.list_cat_col:
            oh = self.oh_dict[col][0]
            df_oh = pd.DataFrame(
                data=oh.transform(_X[[col]]),
                columns=self.oh_dict[col][1],
                index=_X.index)
            _X = pd.concat([_X, df_oh], axis=1)
            _X.drop(col, axis=1, inplace=True)
        return _X

if __name__ == "__main__":
    tex = pd.DataFrame({'city': ['a', 'a', 'e', 'b'], 'state': ['f', 'c', 'd', 'd']})
    coh = CategoricalOneHot(list_key_words=['city', 'state'])
    print(coh.fit_transform(tex))

例子: 给定一个包含两个分类列的数据框:

  city state
0    a     f
1    a     c
2    e     d
3    b     d

输出结果如下:

   city_a  city_b  city_e  state_c  state_d  state_f
0     1.0     0.0     0.0      0.0      0.0      1.0
1     1.0     0.0     0.0      1.0      0.0      0.0
2     0.0     0.0     1.0      0.0      1.0      0.0
3     0.0     1.0     0.0      0.0      1.0      0.0

0
我知道这是老的,但是对于其他可能需要的人来说,我在https://saturncloud.io/blog/pandas-vs-scikitlearn-onehot-encoding-dataframes/上找到了一个简单的解决方案。
就像这样简单(其中一部分是从链接中复制粘贴过来的,加上了一些调整):
# create OneHotEncoder object
encoder = OneHotEncoder()

# fit and transform color column
one_hot_array = encoder.fit_transform(df[['color']]).toarray()

# create new dataframe from numpy array
one_hot_df = pd.DataFrame(one_hot_array, columns = encoder.get_feature_names(), index = df.index)

#concat with df
data = pd.concat([df, one_hot_df], axis=1).drop(['color'], axis=1)

完成啦!

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接