使用sklearn对具有多个类别的分类特征进行编码

Question

使用sklearn对具有多个类别的分类特征进行编码

pandasmachine-learningscikit-learnfeature-extractioncategorical-data

4

我正在处理一个包含类型作为特征的电影数据集。数据集中的样本可以同时属于多种类型，因此它们包含一系列类型标签。

数据如下所示-

    movieId                                         genres
0        1  [Adventure, Animation, Children, Comedy, Fantasy]
1        2                     [Adventure, Children, Fantasy]
2        3                                  [Comedy, Romance]
3        4                           [Comedy, Drama, Romance]
4        5                                           [Comedy]

我想对这个特征进行向量化。我尝试过使用LabelEncoder和OneHotEncoder，但它们似乎无法直接处理这些列表。

我可以手动地对其进行向量化，但是我有其他类似的特征，其中包含太多的类别。对于这些特征，我更希望能以某种方式直接使用FeatureHasher类。

是否有一些方法可以让这些编码器类在此类特征上起作用？或者是否有更好的表示此类特征的方法，使编码更容易？我很乐意听取任何建议。

- H. Saxena

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Peter Leimbigler · Accepted Answer

这个SO提问有一些出色的答案。对于你的示例数据，Teoretic的最后一个答案（使用sklearn.preprocessing.MultiLabelBinarizer）比Paulo Alves的解决方案快了14倍（两者都比被接受的答案快！）：

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
encoded = pd.DataFrame(mlb.fit_transform(df['genres']), columns=mlb.classes_, index=df.index)
result = pd.concat([df['movieId'], encoded], axis=1)

# Increase max columns to print the entire resulting DataFrame
pd.options.display.max_columns = 50
result
   movieId  Adventure  Animation  Children  Comedy  Drama  Fantasy  Romance
0        1          1          1         1       1      0        1        0
1        2          1          0         1       0      0        1        0
2        3          0          0         0       1      0        0        1
3        4          0          0         0       1      1        0        1
4        5          0          0         0       1      0        0        0