我正在尝试使用Pandas和scikit-learn在Python中进行分类。我的数据集包含文本变量、数值变量和分类变量的混合。
假设我的数据集长这样:
Project Cost Project Category Project Description Project Outcome
12392.2 ABC This is a description Fully Funded
493992.4 DEF Stack Overflow rocks Expired
我需要预测变量项目结果
。以下是我的操作步骤(假设df
包含我的数据集):
I converted the categories
Project Category
andProject Outcome
to numeric valuesdf['Project Category'] = df['Project Category'].factorize()[0] df['Project Outcome'] = df['Project Outcome'].factorize()[0]
数据集现在看起来像这样:
Project Cost Project Category Project Description Project Outcome
12392.2 0 This is a description 0
493992.4 1 Stack Overflow rocks 1
Then I processed the text column using
TF-IDF
tfidf_vectorizer = TfidfVectorizer() df['Project Description'] = tfidf_vectorizer.fit_transform(df['Project Description'])
数据集现在看起来像这样:
Project Cost Project Category Project Description Project Outcome
12392.2 0 (0, 249)\t0.17070240732941433\n (0, 304)\t0.. 0
493992.4 1 (0, 249)\t0.17070240732941433\n (0, 304)\t0.. 1
So since all variables are now numerical values, I thought I would be good to go to start training my model
X = df.drop(columns=['Project Outcome'], axis=1) y = df['Project Outcome'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) model = MultinomialNB() model.fit(X_train, y_train)
这个有帮助吗?有没有好方法可以使用具有不同数据类型的变量进行分类?谢谢。
df.isnull().sum().sum()
。 - Danylo Baibak