如何在Python中预处理非常大的数据

Question

如何在Python中预处理非常大的数据

3

我有几个每个100 MB的文件。这些文件的格式如下：

0  1  2  5  8  67  9  122
1  4  5  2  5  8
0  2  1  5  6
.....

（请注意，实际文件中没有添加对齐空格，每个元素之间只有一个空格，添加对齐的目的是美观效果）

每行中的第一个元素是其二进制分类标记，而行中其余部分则是值为1的特征索引。例如，第三行表示该行的第二、第一、第五和第六个特征的值均为1，其余特征的值均为0。

我尝试从每个文件中读取每一行，并使用sparse.coo_matrix创建如下的稀疏矩阵：

for train in train_files:  
    with open(train) as f:
        row = []
        col = []
        for index, line in enumerate(f):
            record = line.rstrip().split(' ')
            row = row+[index]*(len(record)-4)
            col = col+record[4:]
        row = np.array(row)
        col = np.array(col)
        data = np.array([1]*len(row))
        mtx = sparse.coo_matrix((data, (row, col)), shape=(n_row, max_feature))
        mmwrite(train+'trans',mtx)

但是这个处理过程非常耗时。我在晚上开始读取数据，然后让电脑在我睡觉后运行，当我醒来时，它仍然没有完成第一个文件！

有什么更好的方法来处理这种类型的数据吗？

- I-PING Ou

获取稀疏矩阵后，您的目标/目的是什么？ - Chih-Hsu Jack Lin

如果内存不是问题，您可以考虑使用多进程。请参阅 https://docs.python.org/2/library/multiprocessing.html。 - Chih-Hsu Jack Lin

@Chih-HsuJackLin 我想将矩阵作为特征输入，用于训练一些模型，如支持向量机、随机森林等。 - I-PING Ou

@Chih-HsuJackLin 好的，我会去查看一下！谢谢 - I-PING Ou

你期望的结果是一个以样本为行、特征为列的矩阵吗？ - Chih-Hsu Jack Lin

显示剩余4条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Chih-Hsu Jack Lin · Answer 1

我认为这种方法会比你的方法快一些，因为它不是逐行读取文件。你可以尝试使用一个文件的一小部分来运行此代码，并与你的代码进行比较。
此代码还需要事先知道特征数量。如果我们不知道特征数量，那么就需要另外一行被注释掉的代码。

import pandas as pd
from scipy.sparse import lil_matrix
from functools import partial


def writeMx(result, row):
    # zero-based matrix requires the feature number minus 1
    col_ind = row.dropna().values - 1
    # Assign values without duplicating row index and values
    result[row.name, col_ind] = 1


def fileToMx(f):
    # number of features
    col_n = 136
    df = pd.read_csv(f, names=list(range(0,col_n+2)),sep=' ')
    # This is the label of the binary classification
    label = df.pop(0)
    # Or get the feature number by the line below
    # But it would not be the same across different files
    # col_n = df.max().max()
    # Number of row
    row_n = len(label)
    # Generate feature matrix for one file
    result = lil_matrix((row_n, col_n))
    # Save features in matrix
    # DataFrame.apply() is usually faster than normal looping
    df.apply(partial(writeMx, result), axis=0)
    return(result)

for train in train_files:
    # result is the sparse matrix you can further save or use
    result = fileToMx(train)
    print(result.shape, result.nnz)
    # The shape of matrix and number of nonzero values
    # ((420, 136), 15)