Keras pad_sequences抛出无效字面量 for int() with base 10错误

Question

Keras pad_sequences抛出无效字面量 for int() with base 10错误

5

Traceback (most recent call last):
    File ".\keras_test.py", line 62, in <module>
        X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
    File "C:\Program Files\Python36\lib\site-packages\keras\preprocessing\sequence.py", line 69, in pad_sequences
        trunc = np.asarray(trunc, dtype=dtype)
    File "C:\Program Files\Python36\lib\site-packages\numpy\core\numeric.py", line 531, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: invalid literal for int() with base 10: "plus 've added commercials experience tacky"

你好。当我尝试使用Keras的pad_sequence函数时，遇到了以下错误。其中，X_train是一个字符串序列，其中“plus 've added commercials experience tacky”是这些字符串中的第一个。

- doofesohr

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Daniel Möller · Accepted Answer

pad_sequence 函数的默认数据类型为'int32':

keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype='int32', 
                                           padding='pre', truncating='pre', value=0.)

您传递的数据是字符串。

此外，在Keras模型中不能使用字符串。

您必须对这些字符串进行“标记化”。即使您可能认为它可以填充字符串，您也必须决定要使用哪个字符进行填充：

空格？但空格可能是有意义的字符
Null字符？这是最好的想法，但如何使用null字符增加字符串的长度？
如果您正在处理单词而不是字符，其中每个标记/ID具有不同的字符串长度怎么办？

这就是为什么您必须创建一个整数ID值的字典，以表示现有数据中的每个字符或单词，并将所有字符串转换为ID列表。

然后，您可能会从使用Embedding层开始的模型中受益。

例如，如果您正在使用单词ID：

Word 0: null word
Word 1: end of sentence
Word 2: space character (maybe not important to some languages)    
Word 3: a
Word 4: added
Word 5: am    
Word 6: and
....
Word 520: plus
Word 2014: 've
Word 
etc.....

那么你的句子将会是一个列表，其中包含：[520, 2014, 4, ....]