Python ValueError: 非广播输出操作数的形状 (124,1) 与广播形状 (124,13) 不匹配。

4

我希望使用sklearn.preprocessing中的MinMaxScaler对训练集和测试集进行归一化。但是,该包似乎无法接受我的测试数据集。

import pandas as pd
import numpy as np

# Read in data.
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', 
                      header=None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

# Split into train/test data.
from sklearn.model_selection import train_test_split
X = df_wine.iloc[:, 1:].values
y = df_wine.iloc[:, 0].values
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state = 0)

# Normalize features using min-max scaling.
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

执行代码时,我得到了一个警告:DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.,并且还有一个错误:ValueError: operands could not be broadcast together with shapes (124,) (13,) (124,)

尝试重新调整数据仍然会出现错误。

X_test_norm = mms.transform(X_test.reshape(-1, 1))

这种重塑导致了一个错误:ValueError: 非可广播输出操作数的形状为(124,1),与广播形状(124,13)不匹配

任何对如何修复此错误的建议都将有所帮助。


当你遇到形状错误时,你需要做的第一件事是显示进入问题的所有数组的形状。在这种情况下,可能涉及到X_trainX_test等更多数组。 - hpaulj
1个回答

4

训练/测试数据的分割顺序必须与输入数组中指定的顺序相同,以便将它们解压缩到相应的顺序。使用train_test_split()函数时,如果指定的顺序是X_train, y_train, X_test, y_test,则y_train的形状(len(y_train)=54)和X_test的形状(len(X_test)=124)会被交换,从而导致ValueError

因此,你需要:

# Split into train/test data.
#                   _________________________________
#                   |       |                        \
#                   |       |                         \
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)                                        
# |          |                                      /
# |__________|_____________________________________/
# (or)
# y_train, y_test, X_train, X_test = train_test_split(y, X, test_size=0.3, random_state=0)

# Normalize features using min-max scaling.
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

生成:

X_train_norm[0]
array([ 0.72043011,  0.20378151,  0.53763441,  0.30927835,  0.33695652,
        0.54316547,  0.73700306,  0.25      ,  0.40189873,  0.24068768,
        0.48717949,  1.        ,  0.5854251 ])

X_test_norm[0]
array([ 0.72849462,  0.16386555,  0.47849462,  0.29896907,  0.52173913,
        0.53956835,  0.74311927,  0.13461538,  0.37974684,  0.4364852 ,
        0.32478632,  0.70695971,  0.60566802])

1
所以他正在使用13个特征集进行训练,并在1个特征集上进行测试。这解释了不寻常的错误消息。在sklearn中出现形状错误是很常见的,但涉及“不可广播”的错误则不太常见。 - hpaulj
如果他的密集层与其特征数量不匹配,那么这也会导致不可广播错误。 - brohjoe

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接