在Tensorflow 2中分批次训练数据

4
我有一些从sqlite数据库获取的小批量数据,其中包括整数和浮点类型的数据x,以及二进制标签0和1的数据y。我正在寻找像scikit-learn中的X_train,X_test,y_train,y_test = sklearn.model_selection.train_test_split(y,x,test_size = 0.1,random_state = 1,stratify = True)这样的东西,其中关键词可以使数据分层(即相同数量的0类和1类实例)。
在Tensorflow 2中,分层似乎不是直接可能的。我的解决方案对我有效,但由于所有的重塑和转置而需要很长时间。
def stratify(x, y):
    # number of positive instances (the smaller class)
    pos = np.sum(y).item() # how many positive bonds there are
    x = np.transpose(x)

    # number of features 
    f = np.shape(x)[1] 

    # filter only class 1
    y = tf.transpose(y)
    x_pos = tf.boolean_mask(x, 
    y_pos = tf.boolean_mask(y, y)

    # filter only class 1
    x_neg = tf.boolean_mask(x, tf.bitwise.invert(y)-254)
    x_neg = tf.reshape(x_neg, [f,-1])
    y_neg = tf.boolean_mask(y, tf.bitwise.invert(y)-254)

    # just take randomy as many class-0 as there are class-1 
    x_neg = tf.transpose(tf.random.shuffle(tf.transpose(x_neg)))
    x_neg = x_neg[:,0:pos]
    y_neg = y_neg[0:pos]

    # concat the class-1 and class-0 together, then shuffle, and concat back together
    x = tf.concat([x_pos,tf.transpose(x_neg)],0)
    y = tf.concat([y_pos, tf.transpose(y_neg)],0)
    xy = tf.concat([tf.transpose(x), tf.cast(np.reshape(y,[1, -1]), tf.float64)],0)
    xy = tf.transpose((tf.random.shuffle(tf.transpose(xy)))) # because there is no axis arg in shuffle
    x = xy[0:f,:]
    x = tf.transpose(x)
    y = xy[f,:]

    return x, y

我很高兴看到关于我的函数或新颖、更简单的想法的反馈和改进意见。

1个回答

3

最好在将数据转换为张量之前仅使用原始格式进行数据分割。如果有强烈要求只能在TensorFlow中进行,那么我建议您使用tf.data.Dataset类。我已添加了演示代码,并加上了相关注释以解释步骤。

import tensorflow as tf
import numpy as np

TEST_SIZE = 0.1
DATA_SIZE = 1000

# Create data
X_data = np.random.rand(DATA_SIZE, 28, 28, 1)
y_data = np.random.randint(0, 2, [DATA_SIZE])
samples1 = np.sum(y_data)
print('Percentage of 1 = ', samples1 / len(y_data))

# Create TensorFlow dataset
dataset = tf.data.Dataset.from_tensor_slices((X_data, y_data))

# Gather data with 0 and 1 labels separately
class0_dataset = dataset.filter(lambda x, y: y == 0)
class1_dataset = dataset.filter(lambda x, y: y == 1)

# Shuffle them
class0_dataset = class0_dataset.shuffle(DATA_SIZE)
class1_dataset = class1_dataset.shuffle(DATA_SIZE)

# Split them
class0_test_samples_len = int((DATA_SIZE - samples1) * TEST_SIZE)
class0_test = class0_dataset.take(class0_test_samples_len)
class0_train = class0_dataset.skip(class0_test_samples_len)

class1_test_samples_len = int(samples1 * TEST_SIZE)
class1_test = class1_dataset.take(class1_test_samples_len)
class1_train = class1_dataset.skip(class1_test_samples_len)

print('Train Class 0 = ', len(list(class0_train)), ' Class 1 = ', len(list(class1_train)))
print('Test Class 0 = ', len(list(class0_test)), ' Class 1 = ', len(list(class1_test)))

# Gather datasets
train_dataset = class0_train.concatenate(class1_train).shuffle(DATA_SIZE)
test_dataset = class0_test.concatenate(class1_test).shuffle(DATA_SIZE)

print('Train dataset size = ', len(list(train_dataset)))
print('Test dataset size = ', len(list(test_dataset)))

样例输出:

Percentage of 1 =  0.474
Train Class 0 =  474  Class 1 =  427
Test Class 0 =  52  Class 1 =  47
Train dataset size =  901
Test dataset size =  99

如果我有更多的类,这个方法还能用吗?有没有办法让它通用化? - Bishwa Karki

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接