使用CSV数据文件进行简单线性回归 Sklearn

3
我已经尝试了几天,但没有运气。 我想做一个简单的线性回归拟合和预测,使用sklearn,但是我无法让数据与模型配合使用。我知道我没有正确地重塑我的数据,我只是不知道如何做。
如果有任何帮助,将不胜感激。最近我一直在收到这个错误:发现输入变量具有不一致的样本数量:[1, 9]。这似乎意味着Y有9个值,而X只有1个。我认为应该反过来,但当我打印X时,它给我CSV文件中的一行,但y给我CSV文件中的所有行。如果有任何帮助,将不胜感激。

以下是我的代码。

filename = "E:/TestPythonCode/animalData.csv"

#Data set Preprocess data
dataframe = pd.read_csv(filename, dtype = 'category')
print(dataframe.head())
#Git rid of the name of the animal
#And change the hunter/scavenger to 0/1
dataframe = dataframe.drop(["Name"], axis = 1)
cleanup = {"Class": {"Primary Hunter" : 0, "Primary Scavenger": 1     }}
dataframe.replace(cleanup, inplace = True)
print(dataframe.head())
#array = dataframe.values
#Data splt
# Seperating the data into dependent and independent variables
X = dataframe.iloc[-1:]
y = dataframe.iloc[:,-1]
print(X)
print(y)

logReg = LogisticRegression()

#logReg.fit(X,y)
logReg.fit(X[:None],y)
#logReg.fit(dataframe.iloc[-1:],dataframe.iloc[:,-1])

这是csv文件

Name,teethLength,weight,length,hieght,speed,Calorie Intake,Bite Force,Prey Speed,PreySize,EyeSight,Smell,Class
T-Rex,12,15432,40,20,33,40000,12800,20,19841,0,0,Primary Hunter
Crocodile,4,2400,23,1.6,8,2500,3700,30,881,0,0,Primary Hunter
Lion,2.7,416,9.8,3.9,50,7236,650,35,1300,0,0,Primary Hunter
Bear,3.6,600,7,3.35,40,20000,975,0,0,0,0,Primary Scavenger
Tiger,3,260,12,3,40,7236,1050,37,160,0,0,Primary Hunter
Hyena,0.27,160,5,2,37,5000,1100,20,40,0,0,Primary Scavenger
Jaguar,2,220,5.5,2.5,40,5000,1350,15,300,0,0,Primary Hunter
Cheetah,1.5,154,4.9,2.9,70,2200,475,56,185,0,0,Primary Hunter
KomodoDragon,0.4,150,8.5,1,13,1994,240,24,110,0,0,Primary Scavenger

1
X = dataframe.iloc[:, :-1] X = 数据框.iloc[:, :-1] - maxymoo
用那种方法只会给我一个类,这正是我正在用于标签的类,0表示猎人,1表示清道夫。 - MNM
2个回答

5

使用:

X = dataframe.iloc[:,0:-1]

y = dataframe.iloc[:,-1]

1
你需要对“Name”进行标签编码。
txt ="""T-Rex,12,15432,40,20,33,40000,12800,20,19841,0,0,Primary Hunter Crocodile,4,2400,23,1.6,8,2500,3700,30,881,0,0,Primary Hunter Lion,2.7,416,9.8,3.9,50,7236,650,35,1300,0,0,Primary Hunter Bear,3.6,600,7,3.35,40,20000,975,0,0,0,0,Primary Scavenger Tiger,3,260,12,3,40,7236,1050,37,160,0,0,Primary Hunter Hyena,0.27,160,5,2,37,5000,1100,20,40,0,0,Primary Scavenger Jaguar,2,220,5.5,2.5,40,5000,1350,15,300,0,0,Primary Hunter Cheetah,1.5,154,4.9,2.9,70,2200,475,56,185,0,0,Primary Hunter KomodoDragon,0.4,150,8.5,1,13,1994,240,24,110,0,0,Primary Scavenger"""
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from io import StringIO
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix

f = StringIO(txt)
df = pd.read_table(f,sep =',')
df.columns=['Name','TeethLength','Weight','Length','Height','Speed','Calorie Intake','Bite Force','Prey Speed','PreySize','EyeSight','Smell','Class']

transform_dict = {"Class": {"Primary Hunter" : 0, "Primary Scavenger": 1     }}
df.replace(transform_dict, inplace = True)

encoder=LabelEncoder()

COLUMNS=[column for column in df.columns if column not in ['Class']]

X = df[COLUMNS]
y = df.iloc[:,-1]
X['Name_enc']=encoder.fit_transform(X['Name'])
X=X.drop(['Name'],axis=1)

logReg = LogisticRegression()

scaler=StandardScaler()
X=scaler.fit_transform(X)

logReg.fit(X,y)

y_pred_prob=logReg.predict_proba(X)

predictions=logReg.predict(X)

sns.countplot(x=predictions, orient='h')
plt.show()

fpr, tpr, threshholds = roc_curve(y,y_pred_prob[:,1])

plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

cm=confusion_matrix(y,predictions)
sns.heatmap(cm,annot=True,fmt='g')

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接