如何在字符级别上进行一位有效编码（one-hot-encode）句子？

Question

如何在字符级别上进行一位有效编码（one-hot-encode）句子？

3

我希望将句子转换为一个单热向量数组，这些向量将是字母表的单热表示。它看起来像下面这样：

"hello" # h=7, e=4 l=11 o=14

将成为

[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

很遗憾，来自sklearn的OneHotEncoder无法接受字符串作为输入。

- user6903745

你目前尝试了什么？给我们展示一些代码吧！ - Klaus D.

欢迎来到StackOverflow。请阅读并遵守帮助文档中的发布指南。关于主题和提问方式同样适用于这里。 StackOverflow不是设计、编码、研究或教程服务。 - Prune

话虽如此，请查阅有关 chr 和 ord 方法的文档。 - Prune

到目前为止，我尝试了以下方法（应用于语料库中的每个句子），但我想知道是否存在更简单的解决方案。

sentence_chars = [c for c in sentence.lower() if c in alphabet]
ohv = label_binarize(sentence_chars, classes=list(alphabet))
ohv = ohv.astype(bool)

- user6903745

1

sklearn中的OneHotEncoder现已与CategoricalEncoder合并，因此现在可以使用sklearn.preprocessing.OneHotEncoder(categories="auto")。（这是像LSTMs这样的顺序模型的默认表示方式） https://github.com/scikit-learn/scikit-learn/blob/e27242a62d18425886e540c213da044f209d43a8/sklearn/preprocessing/_encoders.py#L106 - devssh

5个回答

9

这是循环神经网络中常见的任务，在tensorflow中有一个专门用于此目的的函数，如果您想使用它。

alphabets = {'a' : 0, 'b': 1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9, 'k':10, 'l':11, 'm':12, 'n':13, 'o':14}

idxs = [alphabets[ch] for ch in 'hello']
print(idxs)
# [7, 4, 11, 11, 14]

# @divakar's approach
idxs = np.fromstring("hello",dtype=np.uint8)-97

# or for more clear understanding, use:
idxs = np.fromstring('hello', dtype=np.uint8) - ord('a')

one_hot = tf.one_hot(idxs, 26, dtype=tf.uint8)
sess = tf.InteractiveSession()

In [15]: one_hot.eval()
Out[15]: 
array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

- kmario23

为什么我们要减去97呢？我发现如果不这样做，one_hot函数就无法正常工作。 - gregoruar

@gregoruar 因为97是字母（即 a）的 ASCII 码起始值。请参考此页面：https://theasciicode.com.ar/ascii-printable-characters/minus-sign-hyphen-ascii-code-45.html 以获取更多详细信息。 - kmario23

3

使用pandas，您可以通过传递一个分类系列来使用pd.get_dummies：

import pandas as pd
import string
low = string.ascii_lowercase

pd.get_dummies(pd.Series(list(s)).astype('category', categories=list(low)))
Out: 
   a  b  c  d  e  f  g  h  i  j ...  q  r  s  t  u  v  w  x  y  z
0  0  0  0  0  0  0  0  1  0  0 ...  0  0  0  0  0  0  0  0  0  0
1  0  0  0  0  1  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
2  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
3  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
4  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0

[5 rows x 26 columns]

- ayhan

3

这里是一个使用 NumPy广播 的向量化方法，用于生成一个形状为(N, 26)的数组。

ints = np.fromstring("hello",dtype=np.uint8)-97
out = (ints[:,None] == np.arange(26)).astype(int)

如果你追求性能，我建议使用一个已初始化的数组，然后赋值。

out = np.zeros((len(ints),26),dtype=int)
out[np.arange(len(ints)), ints] = 1

示例运行 -

In [153]: ints = np.fromstring("hello",dtype=np.uint8)-97

In [154]: ints
Out[154]: array([ 7,  4, 11, 11, 14], dtype=uint8)

In [155]: out = (ints[:,None] == np.arange(26)).astype(int)

In [156]: print out
[[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]]

- Divakar

2

您询问了关于“句子”的问题，但您提供的示例只有一个单词，因此我不确定您想处理空格的方式。但是，就单个单词而言，您的示例可以使用以下代码实现：

def onehot(ltr):
 return [1 if i==ord(ltr) else 0 for i in range(97,123)]

def onehotvec(s):
 return [onehot(c) for c in list(s.lower())]

onehotvec("hello")
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

- MassPikeMike

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- blacksite · Accepted Answer

只需将传递的字符串中的字母与给定的字母表进行比较：

def string_vectorizer(strng, alphabet=string.ascii_lowercase):
    vector = [[0 if char != letter else 1 for char in alphabet] 
                  for letter in strng]
    return vector

注意，使用自定义字母表（例如“defbcazk”）时，列将按照每个元素在原始列表中出现的顺序排序。 string_vectorizer('hello') 的输出结果：

[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]