我有一个包含不同类型特征的csv文件(分类字符串、0-1二进制、浮点数)。我想进行10折交叉验证的SVM回归。根据这篇文章,我尝试了以下方法来读取数据,但是出现了无法将字符串转换为浮点数的错误:
df = pd.read_csv("output.csv")
datanumpy = df.as_matrix()
x = datanumpy[:, 0:143] # select columns 1 through 41 (the features)
y = datanumpy[:, 144] # select column 42 (the labels)
clf = SVC(kernel='linear')
clf.fit(x, y)
你有什么想法可以处理这些因素吗?
错误消息是:
ValueError Traceback (most recent call last)
<ipython-input-22-731136d5a713> in <module>()
75
76 # # fitting x samples and y classes
---> 77 clf.fit(x, y)
78
79
/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/svm/base.py in fit(self, X, y, sample_weight)
149 self._sparse = sparse and not callable(self.kernel)
150
--> 151 X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
152 y = self._validate_targets(y)
153
/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
519 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
520 ensure_2d, allow_nd, ensure_min_samples,
--> 521 ensure_min_features, warn_on_dtype, estimator)
522 if multi_output:
523 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
ValueError: could not convert string to float: '/Users/dorien/AC/Projects/memory/S1 - Stimuli/Exp1-2-Stimuli/MIDI/Stimulus9.mid'
我需要指出哪些列是因素吗?
datanumpy[:,0:143]
只会返回从第0列到第142列的数据,而datanumpy[:,144]
则会选择第144列的数据,这样就忽略了第143列的数据。 - dennlingerpd.dummies()
或labelEncoder()
来处理非数字数据。 - Sociopath