Scikit-Learn - Pandas数据帧中某些列的一位有效编码

Question

Scikit-Learn - Pandas数据帧中某些列的一位有效编码

pythonpandasscikit-learnone-hot-encoding

8

我有一个包含整数、浮点数和字符串列的数据框 X。我想要对每一列进行独热编码，但只针对“Object”类型的列。因此，我正在尝试执行以下操作：

encoding_needed = X.select_dtypes(include='object').columns
ohe = preprocessing.OneHotEncoder()
X[encoding_needed] = ohe.fit_transform(X[encoding_needed].astype(str)) #need astype bc I imputed with 0, so some rows have a mix of zeroes and strings.

然而，我最终遇到了“IndexError: tuple index out of range”的问题。根据文档，编码器期望X：array-like，shape [n_samples，n_features]，所以我应该可以传递一个数据框。如何对encoding_needed中特别标记的列进行独热编码？

编辑：

数据是机密的，因此我无法共享它，也无法创建一个包含123列的虚拟数据集。

我可以提供以下信息：

X.shape: (40755, 123)
encoding_needed.shape: (81,) and is a subset of columns.

全栈：

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-90-6b3e9fdb6f91> in <module>()
      1 encoding_needed = X.select_dtypes(include='object').columns
      2 ohe = preprocessing.OneHotEncoder()
----> 3 X[encoding_needed] = ohe.fit_transform(X[encoding_needed].astype(str))

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3365             self._setitem_frame(key, value)
   3366         elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3367             self._setitem_array(key, value)
   3368         else:
   3369             # set column

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/frame.py in _setitem_array(self, key, value)
   3393                 indexer = self.loc._convert_to_indexer(key, axis=1)
   3394                 self._check_setitem_copy()
-> 3395                 self.loc._setitem_with_indexer((slice(None), indexer), value)
   3396 
   3397     def _setitem_frame(self, key, value):

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
    592                     # GH 7551
    593                     value = np.array(value, dtype=object)
--> 594                     if len(labels) != value.shape[1]:
    595                         raise ValueError('Must have equal len keys and value '
    596                                          'when setting with an ndarray')

IndexError: tuple index out of range

- lte__

1

请提供您的数据样本和完整的错误回溯，而不仅仅是最后一行。 - G. Anderson

1

@G.Anderson，我更新了问题。 - lte__

3个回答

1

没有看到您的数据，我很难找到您的错误。您可以尝试使用 pandas 的 get_dummies 方法？

pd.get_dummies(X[encoding_needed])

- cmxu

X[encoding_needed] = pd.get_dummies(X[encoding_needed]) 的执行结果是 ValueError: Columns must be same length as key。 - lte__

1

如果在使用OneHotEncoder时出现“get_feature_names未找到”的情况，可以尝试以下方法：

import pandas as pd
columns_encode=['string1','string2']
encoder = OneHotEncoder()
df_X_enumeric=X.copy()

for col in columns_encode:
  onehot = encoder.fit_transform(df_X_enumeric[[col]])
  feature_names = encoder.categories_[0]
  onehot_df = pd.DataFrame(onehot.toarray(), columns=feature_names)
  df_X_enumeric= pd.concat([df_X_enumeric, onehot_df], axis=1)


df_X_enumeric.drop(columns_encode, axis=1, inplace=True)

oneHot与dummies也是有帮助的。

- msbeigi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Erfan · Accepted Answer

# example data
X = pd.DataFrame({'int':[0,1,2,3],
                   'float':[4.0, 5.0, 6.0, 7.0],
                   'string1':list('abcd'),
                   'string2':list('efgh')})

   int  float string1 string2
0    0    4.0       a       e
1    1    5.0       b       f
2    2    6.0       c       g
3    3    7.0       d       h

使用 `pandas`

使用 pandas.get_dummies，它会自动选择您的object列，并在添加独热编码列时删除这些列：

pd.get_dummies(X)

   int  float  string1_a  string1_b  string1_c  string1_d  string2_e  \
0    0    4.0          1          0          0          0          1   
1    1    5.0          0          1          0          0          0   
2    2    6.0          0          0          1          0          0   
3    3    7.0          0          0          0          1          0   

   string2_f  string2_g  string2_h  
0          0          0          0  
1          1          0          0  
2          0          1          0  
3          0          0          1

使用`sklearn`

在这里，我们需要明确指定只需要object列：

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

X_object = X.select_dtypes('object')
ohe.fit(X_object)

codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(['string1', 'string2'])

X = pd.concat([df.select_dtypes(exclude='object'), 
               pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)

   int  float  string1_a  string1_b  string1_c  string1_d  string2_e  \
0    0    4.0          1          0          0          0          1   
1    1    5.0          0          1          0          0          0   
2    2    6.0          0          0          1          0          0   
3    3    7.0          0          0          0          1          0   

   string2_f  string2_g  string2_h  
0          0          0          0  
1          1          0          0  
2          0          1          0  
3          0          0          1

Scikit-Learn - Pandas数据帧中某些列的一位有效编码

使用 pandas

使用sklearn

使用 `pandas`

使用`sklearn`