Scikit learn和Pandas的结合非常出色,因此我建议您使用它。这里有一个例子:
In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
data = np.reshape(np.random.randn(20),(10,2))
labels = np.random.randint(2, size=10)
In [2]:
X = pd.DataFrame(data, columns=['Column_1', 'Column_2'])
y = pd.Series(labels)
In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=0)
In [4]: X_test
Out[4]:
Column_1 Column_2
2 -1.39 -1.86
8 0.48 -0.81
4 -0.10 -1.83
In [5]: y_test
Out[5]:
2 1
8 1
4 1
dtype: int32
您可以直接在DataFrame/Series上调用任何scikit函数,它们都可以正常工作。
举个例子,假设您想要执行LogisticRegression,以下是如何以优美的方式检索系数:
In [6]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model = model.fit(X_train, y_train)
df_coefs = pd.DataFrame(model.coef_[0], index=X.columns, columns = ['Coefficient'])
df_coefs
Out[6]:
Coefficient
Column_1 0.076987
Column_2 -0.352463