在C++中将大向量数据读写到二进制文件中

Question

在C++中将大向量数据读写到二进制文件中

3

我有一个C++程序，通过读取网格人口数据并将其存储在一个大的8640x3432元素的双精度向量中，计算给定半径内的人口数量。将ASCII数据读入向量需要约30秒（循环遍历每列和每行），而程序的其余部分仅需几秒钟。要求我通过将人口数据写入二进制文件来加快此过程，这样读取速度会更快。

ASCII数据文件有一些标题行，给出了一些数据规范，如列数和行数，然后是每个网格单元的人口数据，格式为8640个数字的3432行，由空格分隔。人口数据数字是混合格式，可以是0、小数值（0.000685648）或科学计数法表示的值（2.687768e-05）。

我找到了一些读写包含向量的结构体的示例，并尝试实现类似的东西，但遇到了问题。当我在同一程序中同时将向量写入和读取二进制文件时，它似乎可以工作并给我所有正确的值，但最后要么以“segment fault: 11”的形式结束，要么是一个内存分配错误，“未分配指针”被释放。如果我仅从先前写入的二进制文件中读取数据（而不在同一程序运行中重新编写它），则可以正常读取标题变量，但在给出向量数据之前会出现段错误。

非常感谢您提供有关我可能犯的错误或更好的方法的任何建议！我正在Mac上编译和运行，目前没有Boost或其他非标准库。（注意：我非常新于编码，并且需要通过跳入深水区来学习，所以我可能会忽略许多基本概念和术语 - 抱歉！）

这是我想到的代码：

# include <stdio.h>
# include <stdlib.h>
# include <string.h>
# include <fstream>
# include <iostream>
# include <vector>
# include <string.h>

using namespace std;

//Define struct for population file data and initialize one struct variable for reading in ascii (A) and one for reading in binary (B)
struct popFileData
{
    int nRows, nCol;
    vector< vector<double> > popCount; //this will end up having 3432x8640 elements
} popDataA, popDataB;

int main() {

    string gridFname = "sample";

    double dum;
    vector<double> tempVector;

    //open ascii population grid file to stream
    ifstream gridFile;
    gridFile.open(gridFname + ".asc");

    int i = 0, j = 0;

    if (gridFile.is_open())
    {
        //read in header data from file
        string fileLine;
        gridFile >> fileLine >> popDataA.nCol;
        gridFile >> fileLine >> popDataA.nRows;

        popDataA.popCount.clear();

        //read in vector data, point-by-point
        for (i = 0; i < popDataA.nRows; i++)
        {
            tempVector.clear();

            for (j = 0; j<popDataA.nCol; j++)
            {
                gridFile >> dum;
                tempVector.push_back(dum);
            }
            popDataA.popCount.push_back(tempVector);
        }
        //close ascii grid file
        gridFile.close();
    }
    else
    {
        cout << "Population file read failed!" << endl;
    }

    //create/open binary file
    ofstream ofs(gridFname + ".bin", ios::trunc | ios::binary);
    if (ofs.is_open())
    {
        //write struct to binary file then close binary file
        ofs.write((char *)&popDataA, sizeof(popDataA));
        ofs.close();
    }
    else cout << "error writing to binary file" << endl;

    //read data from binary file into popDataB struct
    ifstream ifs(gridFname + ".bin", ios::binary);
    if (ifs.is_open())
    {
        ifs.read((char *)&popDataB, sizeof(popDataB));
        ifs.close();
    }
    else cout << "error reading from binary file" << endl;

    //compare results of reading in from the ascii file and reading in from the binary file
    cout << "File Header Values:\n";
    cout << "Columns (ascii vs binary): " << popDataA.nCol << " vs. " << popDataB.nCol << endl;
    cout << "Rows (ascii vs binary):" << popDataA.nRows << " vs." << popDataB.nRows << endl;

    cout << "Spot Check Vector Values: " << endl;
    cout << "Index 0,0: " << popDataA.popCount[0][0] << " vs. " << popDataB.popCount[0][0] << endl;
    cout << "Index 3431,8639: " << popDataA.popCount[3431][8639] << " vs. " << popDataB.popCount[3431][8639] << endl;
    cout << "Index 1600,4320: " << popDataA.popCount[1600][4320] << " vs. " << popDataB.popCount[1600][4320] << endl;

    return 0;
}

当我在同一次运行中同时写入和读取二进制文件时，以下是输出结果：

File Header Values:
Columns (ascii vs binary): 8640 vs. 8640
Rows (ascii vs binary):3432 vs.3432
Spot Check Vector Values: 
Index 0,0: 0 vs. 0
Index 3431,8639: 0 vs. 0
Index 1600,4320: 25.2184 vs. 25.2184
a.out(11402,0x7fff77c25310) malloc: *** error for object 0x7fde9821c000: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Abort trap: 6

如果我尝试从预先存在的二进制文件中读取，我得到的输出如下：

File Header Values:
Columns (binary): 8640
Rows (binary):3432
Spot Check Vector Values: 
Segmentation fault: 11

感谢您提前的任何帮助！

- Lorien

1

你不能使用 sizeof 计算向量的总大小，因为这只会给出底层结构的大小，而不是任何分配的数据。 - Jonathan Potter

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- halfflat · Accepted Answer

当你将popDataA写入文件时，你实际上是在写入向量的二进制表示。然而，这确实是一个非常小的对象，由指向实际数据（本例中是一系列向量）和一些大小信息的指针组成。

当它被读回到popDataB时，它有点起作用！但只是因为在popDataA中的原始指针现在在popDataB中，并且它指向内存中相同的东西。最后，事情变得疯狂，因为当释放向量的内存时，代码尝试两次释放由popDataA引用的数据（一次是为popDataA，另一次是为popDataB）。

简而言之，在这种方式下将向量写入文件是不合理的。

那么该怎么办？最好的方法是首先确定您的数据表示形式。它将像ASCII格式一样指定要写入的值在哪里，并包括关于矩阵大小的信息，以便在读取它们时知道需要分配多大的向量。

在半伪代码中，写入将看起来像：

int nrow=...;
int ncol=...;
ofs.write((char *)&nrow,sizeof(nrow));
ofs.write((char *)&ncol,sizeof(ncol));
for (int i=0;i<nrow;++i) {
    for (int j=0;j<ncol;++j) {
        double val=data[i][j];
        ofs.write((char *)&val,sizeof(val));
    }
}

阅读将会反向：

ifs.read((char *)&nrow,sizeof(nrow));
ifs.read((char *)&ncol,sizeof(ncol));
// allocate data-structure of size nrow x ncol
// ...
for (int i=0;i<nrow;++i) {
    for (int j=0;j<ncol;++j) {
        double val;
        ifs.read((char *)&val,sizeof(val));
        data[i][j]=val;
    }
}

尽管如此，您应该考虑不要像这样将内容写入二进制文件。这些临时的二进制格式往往会长期存在，超出其预期的实用性，并且容易出现以下问题：

缺乏文档
缺乏可扩展性
没有版本信息的格式更改
在使用保存的数据跨不同机器时存在问题，包括字节顺序问题、整数的默认大小不同等。

因此，我强烈建议您使用第三方库。对于科学数据，HDF5和netcdf4是不错的选择，它们为您解决了上述所有问题，并配备了可以检查数据而不需要了解您特定程序的工具。

轻量级选项包括Boost序列化库和Google的协议缓冲区，但这些只解决了上述问题中的一部分。