使用BERT文本分类时，出现“ValueError: too many dimensions 'str'”错误

Question

使用BERT文本分类时，出现“ValueError: too many dimensions 'str'”错误

pythontensortext-classificationbert-language-modelmlp

15

尝试使用BERT模型对文本情感进行分类，但是遇到了ValueError: too many dimensions 'str'错误。

这是训练数据值的DataFrame；因此它们是train_labels。

0   notr
1   notr
2   notr
3   negative
4   notr
... ...
854 positive
855 notr
856 notr
857 notr
858 positive

这里有一段产生错误的代码：

train_seq = torch.tensor(tokens_train['input_ids'])
train_mask = torch.tensor(tokens_train['attention_mask'])
train_y = torch.tensor(train_labels.tolist())

在 train_y = torch.tensor(train_labels.tolist()); 处出现错误：ValueError: too many dimensions 'str'

请问您能帮我解决吗？

enter image description here

- KazımTibetSar

LabelEncoder来自scikit-learn也可使用。请查看以下文章： https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html - farshad madani

6个回答

11

我也曾经遇到过同样的问题：对我来说，这个方法很有效。我猜你需要在读取CSV后的代码开头执行它： df['labels'] = df['labels'].replace(['negative','notr','positive'],[0,1,2])

然后可以根据这些标签将数据集划分为训练集和测试集。

- mojimoji

你也可以写成：df['labels'] = df['labels'].replace({'negative':0, 'notr':1, 'positive':2})。 - Loich

3

假设您正在使用Hugging Face，

您需要使用数据集（dataset）。

python
from datasets import ClassLabel

c2l = ClassLabel(num_classes=2, names=['spam', 'ham'])

labels = ["spam", "ham", "ham"]

[c2l.str2int(label) for label in labels ]
# [0, 1, 1]

更多参考资料请查看：https://discuss.huggingface.co/t/converting-string-label-to-int/2816

- NpnSaddy

0

谢谢，它确实将其转换为整数，但是关于分类存在问题；

0
0   positive
1   negative
2   positive
3   notr
4   positive
... ...
4002    notr
4003    positive
4004    positive
4005    notr
4006    negative

Frame里有那些数据，转换为整数后，

0   0
1   1
2   2
3   3
4   4
... ...
4002    4002
4003    4003
4004    4004
4005    4005
4006    4006

它变成了这样，我需要的是将所有积极、中性和消极表示为0、-1和-2。

- KazımTibetSar

0

将标签类别替换为数值以避免“str中的维度过多”。

data['labels'] = data['labels'].replace(['inattention to results', 'fear of conflict', 'lack of commitment',
       'avoidance of accountability', 'absence of trust'],[0,1,2,3,4])

- Javaid Iqbal

0

你不能将字符串列表转换为Torch Tensors。

在进行转换之前，你需要将字符串转换为整数或浮点数。

# my_list has strings it it
my_list = ['0','1','2','3','4']

# Items are strings
type(my_list[0])                    
# > str

# Fail to convert to Torch Tensor 
# torch.tensor(my_list)               
# > ValueError: too many dimensions 'str'

# Convert each item to integer
my_list = [int(item) for item in my_list]

# Now, items are integers
type(my_list[0])                    
# > int

# Success
torch.tensor(my_list)                  
# > tensor([0, 1, 2, 3, 4])

- Geoffroy de Viaris

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- coderina · Accepted Answer

原因

问题出在您将字符串列表（str）传递给torch.tensor()，而它只接受数字值的列表（整数、浮点数等）。

解决方案

所以我建议您在将其传递给torch.tensor()之前将字符串标签转换为整数值。

实现

以下代码可能会对您有所帮助：

# a temporary list to store the string labels
temp_list = train_labels.tolist()

# dictionary that maps integer to its string value 
label_dict = {}

# list to store integer labels 
int_labels = []

for i in range(len(temp_list)):
    label_dict[i] = temp_list[i]
    int_labels.append(i)

现在将这个int_labels传递给torch.tensor，并将其用作标签。

train_y = torch.tensor(int_labels)

每当您想查看任何整数的相应字符串标签时，只需使用label_dict字典即可。