ValueError: TextEncodeInput必须是Union[TextInputSequence，Tuple[InputSequence，InputSequence]] - 对BERT / Distilbert进行分词时出现错误。

Question

ValueError: TextEncodeInput必须是Union[TextInputSequence，Tuple[InputSequence，InputSequence]] - 对BERT / Distilbert进行分词时出现错误。

tokenizebert-language-modelhuggingface-transformershuggingface-tokenizersdistilbert

41

def split_data(path):
  df = pd.read_csv(path)
  return train_test_split(df , test_size=0.1, random_state=100)

train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list() 
test_texts, test_labels = test['text'].to_list(), test['sentiment'].to_list() 

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.1, random_state=100)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

当我尝试使用BERT分词器从数据帧中分离时，我遇到了如下错误。

- Raoof Naushad

1

原因是分词器正在尝试对一个非字符串的东西进行分词，这可能是因为 tokenize 函数传递了 None 或任何其他非字符串对象。 - Aayush Neupane

5个回答

15

在我的情况下，我必须将is_split_into_words=True进行设置。

https://huggingface.co/transformers/main_classes/tokenizer.html

要编码的序列或序列批次。每个序列可以是字符串或字符串列表（预分词字符串）。如果提供的序列是字符串列表（已分词），则必须设置is_split_into_words=True（以消除与序列批次的歧义）。

- Ahmad

1

可以确认，在我的情况下，这也解决了问题。 - Timbus Calin

6

与MarkusOdenthal类似，我在我的列表中有一个非字符串类型。在将列转换为字符串后，将其转换为列表，然后再将其分成训练和测试部分。所以你可以这样做：

train_texts = train['text'].astype(str).values.to_list()

- Msalman

如果我需要对列表中的项目进行编码，我该如何做？ - user

0

在分词器中，第一个文本必须是STR，例如： train_encodings = tokenizer(str(train_texts), truncation=True, padding=True)

- mazyar fanaeipour

0

def split_data(path):
  df = pd.read_csv(path)
  return train_test_split(df , test_size=0.2, random_state=100)

train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list() 
test_texts, test_labels = test['text'].to_list(), test['sentiment'].to_list() 

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2, random_state=100)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

尝试更改分割的大小。它会起作用。这意味着分割数据对于标记化器来说不足以进行标记化。

- Raoof Naushad

"train_texts" 只需要一个字符串列表吗？ - Evan Zamir

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MarkusOdenthal · Accepted Answer

我遇到了同样的错误。问题是我的列表中有None，例如：

from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-german-cased')

# create test dataframe
texts = ['Vero Moda Damen Übergangsmantel Kurzmantel Chic Business Coatigan SALE',
         'Neu Herren Damen Sportschuhe Sneaker Turnschuhe Freizeit 1975 Schuhe Gr. 36-46',
         'KOMBI-ANGEBOT Zuckerpaste STRONG / SOFT / ZUBEHÖR -Sugaring Wachs Haarentfernung',
         None]

labels = [1, 2, 3, 1]

d = {'texts': texts, 'labels': labels} 
test_df = pd.DataFrame(d)

因此，在将数据框列转换为列表之前，我删除了所有的 None 行。

因此，在将数据帧的列转换成列表之前，我移除了所有的 None 行。

test_df = test_df.dropna()
texts = test_df["texts"].tolist()
texts_encodings = tokenizer(texts, truncation=True, padding=True)

这对我起作用了。