在NumPy中将索引数组转换为独热编码数组

Question

在NumPy中将索引数组转换为独热编码数组

pythonnumpymachine-learningnumpy-ndarrayone-hot-encoding

354

给定一个一维索引数组：

a = array([1, 0, 3])

我想将其作为一个二维数组进行one-hot编码：

b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])

- James Atwood

22个回答

273

>>> values = [1, 0, 3]
>>> n_values = np.max(values) + 1
>>> np.eye(n_values)[values]
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

- K3---rnc

16

这个方案是将一个输入的N维矩阵转换为N+1维的one-hot矩阵所必需的唯一方案。例如： input_matrix=np.asarray([[0,1,1] , [1,1,2]]) ; np.eye(3)[input_matrix] # 输出一个3D张量 - Isaías

11

+1 是因为这应该优先于已接受的解决方案。不过，为了获得更一般化的解决方案，"values" 应该是一个 Numpy 数组而不是一个 Python 列表，这样它就可以在所有维度上工作，而不仅仅是在 1D 上。 - Alex

15

请注意，如果您的数据集是随机采样的，并且仅仅由于偶然性可能不包含最大值，那么将np.max(values) + 1作为桶的数量可能并不理想。桶的数量应该是一个参数，并且可以进行断言/检查以检查每个值是否在0（包括）和桶计数（不包括）之间。 - NightElfik

3

对我而言，这个解决方案是最好的，并且可以轻松地推广到任何张量： def one_hot(x, depth=10): return np.eye(depth)[x]。请注意，将张量x作为索引返回一个具有x.shape行数的张量。 - cecconeurale

9

在不阅读numpy文档的情况下，理解这个N维解决方案以及为什么它有效的简单方法是：在原始矩阵（values）中的每个位置上，我们有一个整数k，并且我们在该位置“放置”了1-hot向量eye(n)[k]。这增加了一个维度，因为我们正在将向量“放置”到原始矩阵中标量的位置。 - avivr

2

对于那些想知道的人，基准测试显示这段代码比被接受的答案稍微慢一点（https://dev59.com/l10a5IYBdhLWcg3w48Ln#29831596）。 - Alexandre Huat

58

如果您正在使用keras，则有一个内置的实用程序:

from keras.utils.np_utils import to_categorical   

categorical_labels = to_categorical(int_labels, num_classes=3)

它基本上和@YXD的答案做的一样（请见源代码）。

- Jodo

54

以下是我认为有用的内容:

def one_hot(a, num_classes):
  return np.squeeze(np.eye(num_classes)[a.reshape(-1)])

在这里，num_classes表示你拥有的类别数量。如果你有一个形状为(10000,)的向量 a，此函数将把它转换为(10000,C)。请注意，a 是从零开始索引的，即 one_hot(np.array([0, 1]), 2) 将给出 [[1, 0], [0, 1]]。

我相信这正是你想要的。

PS：源自deeplearning.ai-序列模型

- D.Samchuk

2

另外，为什么要使用np.squeeze()呢？因为使用np.eye(num_classes)[a.reshape(-1)]可以得到大小为(向量a的大小)的独热编码数组。你所做的就是使用np.eye创建一个对角矩阵，其中每个类别索引为1，其余为0，然后使用a.reshape(-1)提供的索引生成与np.eye()中索引相对应的输出。我不明白为什么需要np.sqeeze，因为我们使用它只是为了删除单个维度，而我们永远不会有这样的维度，因为输出的维度始终为(a_flattened_size, num_classes)。 - Anu

46

您还可以使用numpy的eye函数：

numpy.eye(类别数目)[包含标签的向量]

- Karma

13

为了更清晰，使用np.identity(num_classes)[indices]可能会更好。回答不错！ - Oliver

1

那是唯一且完全符合Python风格的简洁答案。 - Maksym Ganenko

3

这句话的意思是两年后重复了K3--rnc的答案，但似乎没有人注意到。 - questionto42

还要考虑重新塑造包含标签的向量 numpy.eye(num_class)[labels.reshape(-1)]。例如，如果标签维度为 (x,1)，则它不会产生 (num_class, x, 1) 维度。 - Péter Szilvási

31

您可以使用sklearn.preprocessing.LabelBinarizer：

示例：

import sklearn.preprocessing
a = [1,0,3]
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(range(max(a)+1))
b = label_binarizer.transform(a)
print('{0}'.format(b))

输出：

[[0 1 0 0]
 [1 0 0 0]
 [0 0 0 1]]

除了其他事情外，您可以初始化sklearn.preprocessing.LabelBinarizer()，以便transform的输出是稀疏的。

- Franck Dernoncourt

8

对于1热编码

   one_hot_encode=pandas.get_dummies(array)

举个例子

享受编码过程

- Shubham Mishra

1

谢谢您的评论，但简要描述代码正在执行的操作将非常有帮助！ - Clarus

please refer the example - Shubham Mishra

@Clarus 请查看下面的示例。您可以通过执行one_hot_encode[value]来访问np数组中每个值的独热编码。`>>> import numpy as np

import pandas a = np.array([1,0,3]) one_hot_encode=pandas.get_dummies(a) print(one_hot_encode) 0 1 3 0 0 1 0 1 1 0 0 2 0 0 1 print(one_hot_encode[1]) 0 1 1 0 2 0 Name: 1, dtype: uint8 print(one_hot_encode[0]) 0 0 1 1 2 0 Name: 0, dtype: uint8 print(one_hot_encode[3]) 0 0 1 0 2 1 Name: 3, dtype: uint8`

- Deepak

不是理想的工具 - PigSpider

欢迎来到Stack Overflow。通常情况下，最好将答案自包含，即将示例复制到您的答案中，而不仅仅是链接到它。 - Hugh Perkins

6

这是一个将一维向量转换为二维one-hot数组的函数。

#!/usr/bin/env python
import numpy as np

def convertToOneHot(vector, num_classes=None):
    """
    Converts an input 1-D vector of integers into an output
    2-D array of one-hot vectors, where an i'th input value
    of j will set a '1' in the i'th row, j'th column of the
    output array.

    Example:
        v = np.array((1, 0, 4))
        one_hot_v = convertToOneHot(v)
        print one_hot_v

        [[0 1 0 0 0]
         [1 0 0 0 0]
         [0 0 0 0 1]]
    """

    assert isinstance(vector, np.ndarray)
    assert len(vector) > 0

    if num_classes is None:
        num_classes = np.max(vector)+1
    else:
        assert num_classes > 0
        assert num_classes >= np.max(vector)

    result = np.zeros(shape=(len(vector), num_classes))
    result[np.arange(len(vector)), vector] = 1
    return result.astype(int)

以下是一些使用示例：

>>> a = np.array([1, 0, 3])

>>> convertToOneHot(a)
array([[0, 1, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 1]])

>>> convertToOneHot(a, num_classes=10)
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])

- stackoverflowuser2010

请注意，这仅适用于向量（并且没有“assert”来检查向量形状;)）。 - johndodo

1

+1 对于通用方法和参数检查。然而，作为一种常见的实践，我建议不要使用断言来对输入进行检查。只使用断言来验证内部中间条件。相反，将所有 assert ___ 转换为 if not ___ raise Exception(<Reason>)。 - fnunnari

5

你可以使用以下代码将其转换为一位有效编码向量:

假设x是正常的类向量，只有一列，包含0到某个数字的类别:

import numpy as np
np.eye(x.max()+1)[x]

如果0不是一个类，则去除+1。

- Inaam Ilahi

3

这重复了三年后K3---rnc的答案。 - questionto42

2

为了进一步解释来自K3---rnc的优秀答案，这里提供一个更通用的版本：

def onehottify(x, n=None, dtype=float):
    """1-hot encode x with the max value n (computed from data if n is None)."""
    x = np.asarray(x)
    n = np.max(x) + 1 if n is None else n
    return np.eye(n, dtype=dtype)[x]

此外，这是一个快速而不精确的基准测试，比较了这种方法和目前被接受的答案中的一种方法，该答案由YXD提供（稍作修改，以使它们提供相同的API，只是后者仅适用于1D ndarrays）。

def onehottify_only_1d(x, n=None, dtype=float):
    x = np.asarray(x)
    n = np.max(x) + 1 if n is None else n
    b = np.zeros((len(x), n), dtype=dtype)
    b[np.arange(len(x)), x] = 1
    return b

后一种方法速度更快（MacBook Pro 13 2015，约快35%），但前者更通用。

>>> import numpy as np
>>> np.random.seed(42)
>>> a = np.random.randint(0, 9, size=(10_000,))
>>> a
array([6, 3, 7, ..., 5, 8, 6])
>>> %timeit onehottify(a, 10)
188 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit onehottify_only_1d(a, 10)
139 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

- Emil Melnikov

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- YXD · Accepted Answer

创建一个有足够列数的零数组 b，其列数为 a.max() + 1。
然后，对于每一行 i，将 a[i] 列设为 1。

>>> a = np.array([1, 0, 3])
>>> b = np.zeros((a.size, a.max() + 1))
>>> b[np.arange(a.size), a] = 1

>>> b
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])