我目前正在学习Scikit-learn(请不要责备我),对于ColumnTransformer、训练和预测的过程有些困惑。我的数据集包含性别、婚姻状况、毕业状态、贷款金额、收入等特征。数据集中有一些对象(字符串)和整数值,但我认为大多数是对象。
据我所知,在训练模型之前,我需要将对象转换为整数值,并使用ColumnTransformer进行转换。但是训练模型的过程让我有点困惑。这是我的当前代码:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
df = pd.read_csv("loan_data.csv", sep=",")
df.replace("", np.nan, inplace=True)
df.dropna(inplace=True)
df = df.drop(columns=["Loan_ID"])
X = df.drop(columns=["LoanAmount"])
y = df["LoanAmount"]
loan_categories = ["Gender", "Married", "Dependents", "Education", "Self_Employed", "Property_Area", "Loan_Status"]
ohe = OneHotEncoder()
ct = make_column_transformer (
(ohe, loan_categories),
remainder="passthrough")
ct.fit_transform(X)
然后我对train_test_split感到困惑。我是应该在将X传递给fit_transform之前进行train_test_split,还是现在在定义ct之后进行?
我的其他代码看起来像这样:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)