不使用scikit-learn拆分训练和测试数据集

10

我有一个房价预测数据集,必须将其分成训练集测试集
我想知道是否可以使用numpyscipy来完成这个任务?
目前不能使用scikit-learn

6个回答

11

我知道您的问题只是想使用numpyscipy来进行train_test_split,但实际上Pandas有一种非常简单的方法:

import pandas as pd 

# Shuffle your dataset 
shuffle_df = df.sample(frac=1)

# Define a size for your train set 
train_size = int(0.7 * len(df))

# Split your dataset 
train_set = shuffle_df[:train_size]
test_set = shuffle_df[train_size:]

对于那些想要快速简单解决方案的人。


6
虽然这是一个旧问题,但这个答案可能会有所帮助。
以下是sklearn如何实现train_test_split的方法,该方法接受与sklearn类似的参数。
import numpy as np
from itertools import chain

def _indexing(x, indices):
    """
    :param x: array from which indices has to be fetched
    :param indices: indices to be fetched
    :return: sub-array from given array and indices
    """
    # np array indexing
    if hasattr(x, 'shape'):
        return x[indices]

    # list indexing
    return [x[idx] for idx in indices]

def train_test_split(*arrays, test_size=0.25, shufffle=True, random_seed=1):
    """
    splits array into train and test data.
    :param arrays: arrays to split in train and test
    :param test_size: size of test set in range (0,1)
    :param shufffle: whether to shuffle arrays or not
    :param random_seed: random seed value
    :return: return 2*len(arrays) divided into train ans test
    """
    # checks
    assert 0 < test_size < 1
    assert len(arrays) > 0
    length = len(arrays[0])
    for i in arrays:
        assert len(i) == length

    n_test = int(np.ceil(length*test_size))
    n_train = length - n_test

    if shufffle:
        perm = np.random.RandomState(random_seed).permutation(length)
        test_indices = perm[:n_test]
        train_indices = perm[n_test:]
    else:
        train_indices = np.arange(n_train)
        test_indices = np.arange(n_train, length)

    return list(chain.from_iterable((_indexing(x, train_indices), _indexing(x, test_indices)) for x in arrays))

当然,sklearn的实现支持分层K折交叉验证,Pandas Series数据集的拆分等等。而这个函数只适用于列表和Numpy数组的拆分,我认为这对你的情况应该可行。

2

这段代码应该可以正常运行(假设X_data是一个pandas DataFrame):

import numpy as np
num_of_rows = len(X_data) * 0.8
values = X_data.values
np.random_shuffle(values) #shuffles data to make it random
train_data = values[:num_of_rows] #indexes rows for training data
test_data = values[num_of_rows:] #indexes rows for test data

希望这可以帮到你!

谢谢。还有一个问题。在顶部行上,我有列标签。我认为我需要移除它们。对吗? - CODE_DIY
@CODE_DIY 是的,你应该删除列标签。我建议你保存列标签并写成:df.columns = [(在这里插入列标签)]. - jaguar
最后的排序是不必要的。只需保持它洗牌即可。我还会使用numpy的随机模块中的排列方法,并索引到您的数据框中。https://dev59.com/BV0b5IYBdhLWcg3wIeIw#29576803 - rayryeng

2
这个解决方案只使用pandas和numpy。
def split_train_valid_test(data,valid_ratio,test_ratio):
    shuffled_indcies=np.random.permutation(len(data))
    valid_set_size= int(len(data)*valid_ratio)
    valid_indcies=shuffled_indcies[:valid_set_size]
    test_set_size= int(len(data)*test_ratio)
    test_indcies=shuffled_indcies[valid_set_size:test_set_size+valid_set_size]
    train_indices=shuffled_indcies[test_set_size:]
    return data.iloc[train_indices],data.iloc[valid_indcies],data.iloc[test_indcies]

train_set,valid_set,test_set=split_train_valid_test(dataset,valid_ratio=0.2,test_ratio=0.2)
print(len(train_set),len(valid_set),len(test_set))
##out: (16512, 4128, 4128)

我认为你需要将:train_indices=shuffled_indcies[test_set_size:] 替换为:train_indices=shuffled_indcies[test_set_size+valid_set_size:]。这样可以避免将已经在验证或测试集中的元素移动到训练集中。 - Jorge

1
import numpy as np
import pandas as pd

X_data = pd.read_csv('house.csv')
Y_data = X_data["prices"]
X_data.drop(["offers", "brick", "bathrooms", "prices"], 
            axis=1, inplace=True) # important to drop prices as well

# create random train/test split
indices = range(X_data.shape[0])
num_training_instances = int(0.8 * X_data.shape[0])
np.random.shuffle(indices)
train_indices = indices[:num_training_indices]
test_indices = indices[num_training_indices:]

# split the actual data
X_data_train, X_data_test = X_data.iloc[train_indices], X_data.iloc[test_indices]
Y_data_train, Y_data_test = Y_data.iloc[train_indices], Y_data.iloc[test_indices]

这里假设你想要一个随机分割。我们正在创建一个索引列表,其长度与X_data(或Y_data)的第一个轴相同,即您拥有的数据点数量。然后,我们以随机顺序放置它们,只需将这些随机索引的前80%作为训练数据,其余部分用于测试。[:num_training_indices] 只是从列表中选择前num_training_indices个。之后,您只需使用随机索引列表提取数据行,即可完成数据拆分。请记得从X_data中删除价格并设置种子,如果您希望拆分可重现,请在开头使用np.random.seed(some_integer)

我想将其分为80%的训练和20%的测试。那么代码会是什么? - CODE_DIY
如果你想将其分成80%和20%,请将num_train_examples变量的值设置为数据集行数的80%。如果你有100行,你应该将它设置为80。 - jaguar
@jaguar,你能解释一下 all_data[ :num_train_examples] 吗?我们是在对它进行切片吗?还有其他的资料可以参考吗? - CODE_DIY
@CODE_DIY 请检查我的答案,我认为这可能会更有帮助。 - jaguar
这是我的代码,import pandas as pd X_data= pd.read_csv('house.csv') X_data.drop(['offers','brick','bathrooms'],axis =1,inplace= True) y_data= X_data['price']。现在我需要将其拆分。 - CODE_DIY
显示剩余2条评论

0

这里有一个使用 random 库快速实现 80/20 分割的方法:

import random
# Define a sample size, here 80% of the observations
sample_size = int(len(x)*0.80)
# Set seed for reproducibility
random.seed(47202182)
# indices are randomly sampled from 0 to the length of the original sample
train_idx = random.sample(range(0, len(x)), sample_size)
# Indices not in the train set must be in the test set
test_idx = [i for i in range(0, len(x)) if i not in train_idx]
# apply indices to lists to assign data to corresponding variables
x_train = [x[i] for i in train_idx]
x_test = [x[i] for i in test_idx]
y_train = [y[i] for i in train_idx]
y_test = [y[i] for i in test_idx]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接