如何在Keras中将预测序列转换回文本?

26

我有一个序列到序列的学习模型,它能够很好地预测一些输出。问题是我不知道如何将输出转换回文本序列。

这是我的代码。

from keras.preprocessing.text import Tokenizer,base_filter
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense

txt1="""What makes this problem difficult is that the sequences can vary in length,
be comprised of a very large vocabulary of input symbols and may require the model 
to learn the long term context or dependencies between symbols in the input sequence."""

#txt1 is used for fitting 
tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ")
tk.fit_on_texts(txt1)

#convert text to sequence
t= tk.texts_to_sequences(txt1)

#padding to feed the sequence to keras model
t=pad_sequences(t, maxlen=10)

model = Sequential()
model.add(Dense(10,input_dim=10))
model.add(Dense(10,activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

#predicting new sequcenc
pred=model.predict(t)

#Convert predicted sequence to text
pred=??

还没有答案吗? - Ben Usman
@BenUsman,你找到这个问题的解决方案了吗?我也遇到了同样的问题。 - TVH7
@TVH7,请查看已发布的答案。 - Ben Usman
1
@Eka 或许你应该接受一个答案以关闭帖子。 - Esben Eickhardt
5个回答

23
您可以直接使用反转函数tokenizer.sequences_to_texts
    text = tokenizer.sequences_to_texts(<list_of_integer_equivalent_encodings>)

我已经测试过上述内容,并且按照预期工作。

PS:特别注意将参数设置为整数编码列表,而不是 One Hot 编码。


3
似乎这是最直接的答案,如果你想看它的作用,请尝试以下代码:print(tokenizer.sequences_to_texts([[1]])) - GoTrained
在对其运行text_to_sequence之前,请确保从<list-of-integer-equivalent-encodings>中删除填充(即删除使用的填充编码)和布尔值的编码。 - Divya Dass

21

这是我找到的一个解决方案:

reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))

17

我也遇到了同样的问题,这就是我最终是如何解决它的(受@Ben Useman反向词典启发)。

# Importing library
from keras.preprocessing.text import Tokenizer

# My texts
texts = ['These are two crazy sentences', 'that I want to convert back and forth']

# Creating a tokenizer
tokenizer = Tokenizer(lower=True)

# Building word indices
tokenizer.fit_on_texts(texts)

# Tokenizing sentences
sentences = tokenizer.texts_to_sequences(texts)

>sentences
>[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11, 12, 13]]

# Creating a reverse dictionary
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))

# Function takes a tokenized sentence and returns the words
def sequence_to_text(list_of_indices):
    # Looking up words in dictionary
    words = [reverse_word_map.get(letter) for letter in list_of_indices]
    return(words)

# Creating texts 
my_texts = list(map(sequence_to_text, sentences))

>my_texts
>[['these', 'are', 'two', 'crazy', 'sentences'], ['that', 'i', 'want', 'to', 'convert', 'back', 'and', 'forth']]

3
这是一个替代代码,用于反转 word_index 的顺序:reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) - Ozkan Serttas

4
你可以创建一个将索引映射回字符的字典。
index_word = {v: k for k, v in tk.word_index.items()} # map back
seqs = tk.texts_to_sequences(txt1)
words = []
for seq in seqs:
    if len(seq):
        words.append(index_word.get(seq[0]))
    else:
        words.append(' ')
print(''.join(words)) # output

>>> 'what makes this problem difficult is that the sequences can vary in length  
>>> be comprised of a very large vocabulary of input symbols and may require the model  
>>> to learn the long term context or dependencies between symbols in the input sequence '

然而,在这个问题中,您试图使用字符序列来预测10个类别的输出,这不是序列到序列模型。在这种情况下,您不能仅仅将预测(或pred.argmax(axis=1))转换回字符序列。


0
    p_test = model.predict(data_test).argmax(axis =1)

#Show some misclassified examples
misclassified_idx = np.where(p_test != Ytest)[0]
len(misclassified_idx) 
i= np.random.choice(misclassified_idx)
print((i))
print((df_test[i]))
print('True label %s Predicted label %s' , (Ytest[i], p_test[i]))

df_test is the original text
data_test is sequence of integer 

请确保描述您发布的代码。 - finnmglas

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接