NumPy将分类字符串数组转换为整数数组

Question

NumPy将分类字符串数组转换为整数数组

30

我正试图将一个字符串类型的分类变量数组转换成整数类型的分类变量数组。

例如：

import numpy as np
a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])
print a.dtype
>>> |S1

b = np.unique(a)
print b
>>>  ['a' 'b' 'c']

c = a.desired_function(b)
print c, c.dtype
>>> [1,2,3,1,2,3] int32

我意识到可以用循环来实现，但我想象中有一种更简单的方法。谢谢。

- wroscoe

9个回答

36

......多年后......

为了完整性（因为答案中没有提到这一点）和个人原因（我在我的模块中总是导入pandas，但不一定导入sklearn），使用pandas.get_dummies()也非常简单。

import numpy as np
import pandas

In [1]: a = np.array(['a', 'b', 'c', 'a', 'b', 'c'])

In [2]: b = pandas.get_dummies(a)

In [3]: b
Out[3]: 
      a  b  c
   0  1  0  0
   1  0  1  0
   2  0  0  1
   3  1  0  0
   4  0  1  0
   5  0  0  1

In [3]: b.values.argmax(1)
Out[4]: array([0, 1, 2, 0, 1, 2])

- benjaminmgross

1

谢谢。终于找到我一直在寻找的答案了。 - SeeTheC

18

一种方法是使用来自scikits.statsmodels的categorical函数。例如：

In [60]: from scikits.statsmodels.tools import categorical

In [61]: a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])

In [62]: b = categorical(a, drop=True)

In [63]: b.argmax(1)
Out[63]: array([0, 1, 2, 0, 1, 2])

< p > categorical 的返回值实际上是一个设计矩阵，因此上面调用 argmax 以使其接近所需的格式。

In [64]: b
Out[64]:
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

- ars

5

另一种方法是使用sklearn.preprocessing.LabelEncoder。

它可以将可哈希标签（如字符串）转换为介于0和n_classes-1之间的数值。

操作步骤如下：

# Repeating setup from the question to make example copy/paste-able
import numpy as np
a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])
b = np.unique(a)

# Answer to the question
from sklearn import preprocessing
pre = preprocessing.LabelEncoder()
pre.fit(b)
c = pre.transform(a)

print(c)    # Prints [0 1 2 0 1 2]

如果你坚持希望在结果数组中从1开始设置值，那么可以在此操作之后简单地执行“c + 1”操作。

如果只是为了这个目的而将sklearn作为项目的依赖项可能不值得，但如果已经导入了sklearn，则这是一个不错的选择。

- Tim Skov Jacobsen

我们如何知道'a'是'0'等等。有没有可以返回这样的代码？ - bib

@bib：我相信每次从左到右遍历数组时，遇到新字符串就会分配一个新的运行编号/索引。因此，'a'是0，因为它是第一个被看到的字符串。 - Tim Skov Jacobsen

5

另一种选择是使用分类的 Pandas Series：

>>> import pandas as pd
>>> pd.Series(['a', 'b', 'c', 'a', 'b', 'c'], dtype="category").cat.codes.values

array([0, 1, 2, 0, 1, 2], dtype=int8)

- Gregor Sturm

2

另一种方法是使用Pandas的factorize将项目映射到数字：

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: a = np.array(['a', 'b', 'c', 'a', 'b', 'c'])
In [4]: a_enc = pd.factorize(a)
In [5]: a_enc[0]
Out[5]: array([0, 1, 2, 0, 1, 2])
In [6]: a_enc[1]
Out[6]: array(['a', 'b', 'c'], dtype=object)

- tomp

1

...几年后...

考虑到完整性，我提供了一个纯Python解决方案：

def count_unique(a):
    def counter(item, c=[0], items={}):
        if item not in items:
            items[item] = c[0]
            c[0] += 1
        return items[item]
    return map(counter, a)

a = [0, 2, 6, 0, 2]
print count_unique(a)
>> [0, 1, 2, 0, 1]

- kezzos

1

嗯，这是一个hack...但它有帮助吗？

In [72]: c=(a.view(np.ubyte)-96).astype('int32')

In [73]: print(c,c.dtype)
(array([1, 2, 3, 1, 2, 3]), dtype('int32'))

- unutbu

19

你真的想要加上这个警告，说明这仅适用于长度为1的字符串。 - smci

0

你也可以尝试这样做：

a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])
a[a == 'a'] = 1
a[a == 'b'] = 2
a[a == 'c'] = 3
a = a.astype(np.float32)

如果您知道其中的内容并希望为每个值设置特定索引，那将更好。

如果只有两个类别，下面的代码将像魔法一样运行：

a = np.array( ['a', 'b', 'a', 'b'])
a = np.float32(y == 'a')

- user8659363

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Josef · Accepted Answer

np.unique有一些可选的返回值

return_inverse会给出整数编码，我经常使用它

>>> b, c = np.unique(a, return_inverse=True)
>>> b
array(['a', 'b', 'c'], 
      dtype='|S1')
>>> c
array([0, 1, 2, 0, 1, 2])
>>> c+1
array([1, 2, 3, 1, 2, 3])

它可以用来从唯一值中重新创建原始数组

>>> b[c]
array(['a', 'b', 'c', 'a', 'b', 'c'], 
      dtype='|S1')
>>> (b[c] == a).all()
True