如何使用pandas数据框架与sklearn？

Question

如何使用pandas数据框架与sklearn？

3

我项目的目标是预测一些文本描述的准确度水平。

我使用FASTTEXT创建向量。

TSV输出：

0  1:0.0033524514 2:-0.021896651 3:0.05087798 4:0.0072637126 ...
1  1:0.003118149 2:-0.015105667 3:0.040879637 4:0.000539902 ...

资源被标记为好（1）或坏（0）。

为了检查准确性，我使用了scikit-learn和SVM。

按照这个教程，我编写了以下脚本：


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt

r_filenameTSV = 'TSV/A19784.tsv'

tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])

df = pd.DataFrame(tsv_read)

df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
                                   columns = ['label','vector'])


print ("Features:" , df.vector)

print ("Labels:" , df.label)

X_train, X_test, y_train, y_test = train_test_split(df.vector, df.label, test_size=0.2,random_state=0)

#Create a svm Classifier
clf = svm.SVC(kernel='linear') 

#Train the model using the training sets
clf.fit (str((X_train, y_train)))

#Predict the response for test dataset
y_pred = clf.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

第一次尝试运行脚本时，在第28行出现了以下错误：

ValueError: could not convert string to float:

所以我从

clf.fit (X_train, y_train)

为了


clf.fit (str((X_train, y_train)))

然后，在同一行上，我收到了这个错误。

TypeError: fit() missing 1 required positional argument: 'y'

建议如何解决这个问题？

谢谢您的时间，此致敬礼。

- Pelide

你有检查过 str((X_train, y_train)) 返回的结果吗（它是无效的）？请分享一些训练数组的样本。 - undefined

2

你需要通过逗号将训练数据和标签分开，所以现在它认为str((X_train, y_train))是x_train。如果你确保在使用fit之前x_train和y_train都是数值型的话，它应该可以工作。 - undefined

2

df = pd.DataFrame(df.vector.str.split(' ',1).tolist(), columns = ['label','vector']) 告诉我你的数据仍然是字符串而不是数字，这在支持向量机（SVM）中是不被支持的，你需要将数据转换为整数或浮点数。 - undefined

如果你检查tsv_read，你会发现它已经是一个数据框。df = pd.DataFrame(tsv_read)这一行是不必要的。你创建了tsv文件吗？如果是的话，是怎么创建的？看起来像字典一样的值是字符串（带引号的）吗？通过csv/tsv这样的文本文件保存/加载数据框是很麻烦的。 - undefined

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- zwithouta · Accepted Answer

如下面评论中提到的，您的特征和标签可能是字符串。但是，sklearn要求它们为数值型（通常使用numpy数组）。如果是这种情况，则必须将数据帧的元素从字符串转换为数值。

根据您的代码，我假设特征列的每个元素都是字符串列表，标签列的每个元素都是单个字符串。以下是将这样的数据帧转换为包含数值的示例。

import numpy as np
import pandas as pd

df = pd.DataFrame({'features': [['5', '4.2'], ['3', '7.9'], ['2', '9']],
                   'label': ['1', '0', '0']})
print(type(df.features[0][0]))
print(type(df.label[0]))


def convert_to_float(collection):
    floats = [float(el) for el in collection]
    return np.array(floats)


df_numeric = pd.concat([df["features"].apply(convert_to_float),
                pd.to_numeric(df["label"])],
               axis=1)
print(type(df_numeric.features[0][0]))
print(type(df_numeric.label[0]))

然而，所描述的数据框格式并不是sklearn模型期望pandas数据框具有的格式。据我所知，sklearn模型期望每个特征都存储在单独的列中，就像这里的情况一样：

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

feature_df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=["feature_1", "feature_2"])
label_df = pd.DataFrame(np.array([[1], [0], [0]]), columns=["label"])
df = pd.concat([feature_df, label_df], axis=1)

X_train, X_test, y_train, y_test = train_test_split(df.drop(["label"], axis=1), df["label"], test_size=1 / 3)
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)
clf.predict(X_test)

也就是说，将您的数据框转换为仅包含数字值后，您需要为特征列中列表中的每个元素创建一个自己的列。您可以按照以下方式执行：

arr = np.concatenate(df_numeric.features.to_numpy()).reshape(df_numeric.shape)
df_sklearn_compatible = pd.concat([pd.DataFrame(arr, columns=["feature_1", "feature_2"]),
                                   df["label"]],
                                  axis=1)