从文件中快速读取第n行的方法

Question

从文件中快速读取第n行的方法

5

介绍

我有一个名为MyProcess的C++进程，我调用它nbLines次，其中nbLines是一个大文件InputDataFile.txt中输入数据行数。例如，调用：

./MyProcess InputDataFile.txt 142

通知 MyProcess 输入数据在 InputDataFile.txt 文件的第 142 行。

问题

问题在于 InputDataFile.txt 文件太大 (~ 150 GB)，搜索正确行的时间不可忽略。受到这篇帖子的启发，以下是我的（可能不是最佳）代码：

int line = 142;
int N = line - 1;
std::ifstream inputDataFile(filename.c_str());
std::string inputData;
for(int i = 0; i < N; ++i)
    std::getline(inputDataFile, inputData);

std::getline(inputDataFile,inputData);

目标

我的目标是使MyProcess中inputData的搜索更快。

可能的解决方案

在bash中，将每行第一个字符的索引与行号匹配，这样就不需要给MyProcess提供142，而可以直接提供感兴趣的第一个字符的索引位置。 MyProcess可以直接跳转到该位置，而无需搜索和计算'\n'字符的数量。然后它将读取数据，直到遇到'\n'字符为止。这种方法可行吗？如何实现？

当然，我欢迎任何其他减少导入输入数据总体计算时间的解决方案。

- Remi.b

可能是重复问题：在C++中是否存在一种方法可以跳转到文本文件中的特定行？ - Thomas Matthews

可能是重复问题：在C++中获取文本文件的第n行 - Thomas Matthews

输入数据必须存储为纯文本的原因是什么？为什么不使用更可搜索的存储方法？这个文件是否经常更改或始终保持不变？ - Alex Zywicki

文件的大小没有变化。我想没有什么特定的原因需要将输入数据存储为纯文本。我只是不知道是否有其他解决方案。 - Remi.b

@anubhava 因为可能的解决方案暗示着使用 bash，但如果您认为它不合理，我可以删除该标签。 - Remi.b

3个回答

1

没有“快速”方法可以读取文件的第N行文本。

文本文件包含可变长度的记录。每个记录以换行符结尾。必须逐个字符地读取文本，直到找到换行符为止。这可能是1个字符或245个字符。没有标准大小。

通常做法是读取每一行并忽略该行，直到到达所需的行。

如果您经常需要转到文件中的特定行，则可以维护行号及其文件位置的映射。

否则，您可以尝试将块或块读入缓冲区并扫描缓冲区。这将加速您的程序，但您必须考虑文本行可能跨越缓冲区边界。请记住，输入在保持流式传输时最有效（想象一条数据河流）。

- Thomas Matthews

谢谢你的回答。我不确定我是否正确理解了你的句子“行号和它们的文件位置的映射”，但它感觉非常像我在帖子中提出的可能的解决方案。我可以在bash中为每一行创建一次映射，然后将这个映射传递给./MyProcess。不过我不知道如何实现。 - Remi.b

维护一个文件数据库，包括它们的修改日期和每个换行符的偏移量数组。当解析一个文件时，如果你有该文件的记录，并且日期与文件的修改日期相匹配，则跳转到正确的行。如果没有，则逐行读取文件，并记录每个换行符的文件偏移量。更新该文件的记录以及其最新的修改日期到你的数据库中。为了提高性能，数据库可能会使用文件名的哈希查找。 - blackghost

由于您将帖子标记为C ++，因此可以使用std :: map <unsigned int，std :: streampos>来包含行号和文件位置（分别）。在读取一行之前，请将行号和文件位置对添加到map中。 - Thomas Matthews

0

既然这个标签带有bash，那么这里是一个使用sed的简单函数

定义

getline() { sed "${2}q;d" "$1"; }

用法

getline InputData.txt 142

- karakfa

谢谢。我知道如何在Bash中获取一行。我的问题是，每次调用MyProcess时，MyProcess需要找到正确的行，因此我认为在Bash中我们可以创建一个InputData.txt的映射，以便将其提供给MyProcess，从而加快在特定行上搜索数据的速度。如果您对问题不清楚，请告诉我。 - Remi.b

最佳解决方案取决于上下文。您了解使用模式吗？这些行中有多少将被访问以及以什么顺序进行访问？您可以通过将文件分割为N个段并实现两层访问来减少线性扫描时间。 - karakfa

所有行都会被访问，但是访问时间会非常不同(比如相隔几周)。我最终将调用所有的 MyProcess ${dataFile} ${LineNumber}。由于 MyProcess 太慢了(因为每个 MyProcess 调用当前都需要独立地在文件中搜索正确的行)，我正在考虑计算一次映射表(这将只需要通过整个文件进行一次筛选)。将映射表存储在硬盘上，并在调用 MyProcess 时将映射表提供给它 (MyProcess ${dataFile} ${LineNumber} ${map})。你能理解我的问题吗？ - Remi.b

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex Zywicki · Accepted Answer

正如其他答案所建议的那样，构建文件映射可能是一个好主意。我会用伪代码实现这个过程：

let offset be a unsigned 64 bit int =0;

for each line in the file 
    read the line
    write offset to a binary file (as 8 bytes rather as chars)
    offset += length of line in bytes

现在您有一个“Map”文件，其中包含64位整数列表（每行一个）。要读取地图，只需计算所需行的条目在地图中的位置：

offset = desired_line_number * 8 // where line number starts at 0
offset2 = (desired_line_number+1) * 8

data_position1 = load bytes [offset through offset + 8] as a 64bit int from map
data_position2 = load bytes [offset2 through offset2 + 8] as a 64bit int from map

data = load bytes[data_position1 through data_position2-1] as a string from data.

这个想法是你只需一次读取数据文件并记录每行开始的字节偏移量，然后使用固定大小的整数类型将偏移量按顺序存储在二进制文件中。映射文件的大小应为number_of_lines * sizeof(integer_type_used)。然后，您只需通过计算存储行号偏移的位置的偏移量来查找映射文件，并读取该偏移量以及下一行的偏移量。从那里，您就有了数据所在的字节数值范围。

例子：

数据：

hello\n 
world\n
(\n newline at end of file)

创建地图。

地图：每个分组[number]将表示文件中的8个字节长度。

[0][7][14]
//or in binary
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000111
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00001110

现在假设我想要第二行：

line offset = 2-1 * 8 // offset is 8

因为我们使用的是基于0的系统，所以这将会是文件中的第9个字节。所以我们的数字由第9到17个字节组成，它们分别是：

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000111
//or as decimal
7

现在我们知道了，我们的数据文件中的行应该从偏移量7开始（此偏移量为基数1，如果我们从0开始计数，则该偏移量将为6）。

然后我们执行相同的过程来获取下一行的起始偏移量，即14。

最后，我们查找字节范围7-14（基础1，0-13基础0），将其存储为字符串并获得world\n。

C++实现：

#include <iostream>
#include <fstream>

int main(int argc, const char * argv[]) {
    std::string filename = "path/to/input.txt";

    std::ifstream inputFile(filename.c_str(),std::ios::binary);
    std::ofstream outfile("path/to/map/file.bin",std::ios::binary|std::ios::ate);

    if (!inputFile.is_open() || !outfile.is_open()) {
        //use better error handling than this
        throw std::runtime_error("Error opening files");
    }


    std::string inputData;
    std::size_t offset = 0;
    while(std::getline(inputFile, inputData)){
        //write the offset as binary
        outfile.write((const char*)&offset, sizeof(offset));
        //increment the counter
        offset+=inputData.length()+2;
        //add one becuase getline strips the \n and add one to make the index represent the next line
    }
    outfile.close();

    offset=0;

    //from here on we are reading the map
    std::ifstream inmap("/Users/alexanderzywicki/Documents/xcode/textsearch/textsearch/map",std::ios::binary);
    std::size_t line = 2;//your chosen line number
    std::size_t idx = (line-1) * sizeof(offset); //the calculated offset
    //seek into the map
    inmap.seekg(idx);
    //read the binary at that location
    inmap.read((char*)&offset, sizeof(offset));
    std::cout<<offset<<std::endl;

    //from here you just need to lookup from the data file in the same manor


    return 0;
}