如何在C++中将整个文件读入std::string？

Question

如何在C++中将整个文件读入std::string？

261

如何将文件一次性读入std::string？即，整个文件一次性读取。

文本或二进制模式应由调用者指定。解决方案应符合标准、可移植且高效。它不应不必要地复制字符串数据，并且在读取字符串时应避免重新分配内存。

一种方法是使用stat函数获取文件大小，调整std::string的大小并使用fread()将内容读入std::string的const_cast<char*>()转换后的data()中。这需要std::string的数据是连续的，这在标准中没有要求，但它似乎适用于所有已知的实现。更糟糕的是，如果以文本模式读取文件，则std::string的大小可能不等于文件的大小。

一个完全正确、符合标准和可移植的解决方案可以使用std::ifstream的rdbuf()读入std::ostringstream，然后再将其读入std::string。但是，这可能会复制字符串数据并/或不必要地重新分配内存。

所有相关的标准库实现是否都足够聪明，能避免所有不必要的开销？
是否有其他方法可以实现？
我是否错过了一些隐藏的Boost函数，已经提供所需的功能？

void slurp(std::string& data, bool is_binary)

- wilbur_m

1

文本模式和二进制模式是MSDOS和Windows特定的技巧，旨在解决Windows中换行符由两个字符（CR / LF）表示的事实。在文本模式下，它们被视为一个字符（'\n'）。 - Ferruccio

2

虽然不是完全重复，但这与以下内容密切相关：如何为std :: string对象预分配内存？（与Konrad上面的声明相反，该内容包括代码来执行此操作，直接将文件读入目标，而不进行额外的复制）。 - Jerry Coffin

2

“连续性不是标准所必需的” - 事实上，从某种程度上来说是必需的。一旦您在字符串上使用op[]，它必须被合并成一个连续的可写缓冲区，因此如果您首先使用.resize()调整大小足够大，那么写入&str[0]是绝对安全的。而且在C++11中，字符串总是连续的。 - Tino Didriksen

4

相关链接：如何在C++中读取文件？ -- 对不同的方法进行了基准测试和讨论。而且，被接受答案中的 rdbuf 并不是最快的，read 才是。 - legends2k

2

如果文件的编码/解释错误，则所有这些解决方案都会导致字符串格式不正确。当我将JSON文件序列化为字符串时，一直出现奇怪的问题，直到我手动将其转换为UTF-8；无论我尝试什么解决方案，我始终只得到第一个字符！这是需要注意的事情！ :) - kayleeFrye_onDeck

显示剩余2条评论

24个回答

89

最短的变体：在Coliru上实时运行

std::string str(std::istreambuf_iterator<char>{ifs}, {});

需要包含头文件<iterator>。

有一些报告称这种方法比预分配字符串并使用std::istream::read慢。然而，在启用优化的现代编译器上，这似乎不再是问题，尽管各种方法的性能相对而言高度依赖于编译器。

- Konrad Rudolph

10

你能详细说明一下这个答案吗？它的效率如何？它是逐个字符读取文件的吗？有没有办法预先分配字符串内存？ - Martin Beckett

@M.M 从我的理解来看，这种方法比纯C++读取到预分配缓冲区的方法要慢。 - Konrad Rudolph

你说得对，这是标题在代码示例下面而不是上面的情况 :) - M.M

@coincheung 很不幸，是的。如果你想避免内存分配，你需要手动缓冲读取。C++ IO 流非常糟糕。 - Konrad Rudolph

1

@coincheung 这样做应该避免重复分配内存，但实际上它愚蠢地没有。在C++17中读取整个文件的“规范”方式是https://gist.github.com/klmr/849cbb0c6e872dff0fdcc54787a66103。不幸的是，这非常冗长。 - Konrad Rudolph

显示剩余2条评论

56

请参考类似问题的这个回答。

为了您的方便，我会重新发布CTT的解决方案：

string readFile2(const string &fileName)
{
    ifstream ifs(fileName.c_str(), ios::in | ios::binary | ios::ate);

    ifstream::pos_type fileSize = ifs.tellg();
    ifs.seekg(0, ios::beg);

    vector<char> bytes(fileSize);
    ifs.read(bytes.data(), fileSize);

    return string(bytes.data(), fileSize);
}

当对《白鲸记》（1.3M）的文本进行100次运行的平均值时，这个解决方案相比其他答案快了大约20％。对于一个可移植的C ++解决方案来说，效果不错，我想看看使用mmap将文件映射到内存中的结果；)

- ceretullis

3

相关：各种方法的时间性能比较：在C++中一次性读取整个文件 - jfs

24

直到今天，我从未见过tellg（）报告非文件大小的结果。花了我几个小时才找到错误的源头。请不要使用tellg（）来获取文件大小。 - Puzomor Croatia

1

还要检查空文件，因为您将通过&bytes[0]解除引用nullptr。 - Andriy Tylychko

1

@paxos1977> 确定您的程序在哪些系统上被定义为正确的是由您决定的。目前，它依赖于C++没有提供的保证，因此是错误的。如果它可以在一组已知提供这些保证的实现中工作（例如：记录为保证，而不仅仅是“今天在我周围的那个版本上看起来还好”），则应明确说明，否则会产生误导。 - spectras

1

作为一名经验丰富的专业人士，你可能会听到很多关于“这不应该发生”的 bug，尤其是在长期存在的代码库中工作时。你会明白，“这不应该发生”、“我们永远不会针对其他平台”、“编译器永远不会有新版本”等说法，在一个公司的实际生产代码库中并不存在，因为公司的产品发布后还会继续存在。所以，当你不得不依赖这些说法时，最起码要清楚地标记它，并为其编写单元测试。 - spectras

显示剩余11条评论

50

如果您使用的是 C++17 (std::filesystem)，还可以通过以下方式获取文件大小（使用 std::filesystem::file_size 而非 seekg 和 tellg）：

#include <filesystem>
#include <fstream>
#include <string>

namespace fs = std::filesystem;

std::string readFile(fs::path path)
{
    // Open the stream to 'lock' the file.
    std::ifstream f(path, std::ios::in | std::ios::binary);

    // Obtain the size of the file.
    const auto sz = fs::file_size(path);

    // Create a buffer.
    std::string result(sz, '\0');

    // Read the whole file into the buffer.
    f.read(result.data(), sz);

    return result;
}

注意：如果你的标准库还不完全支持C++17，可能需要使用<experimental/filesystem>和std::experimental::filesystem。如果它不支持non-const std::basic_string data，你可能还需要将result.data()替换为&result[0]。

- Gabriel Majeri

6

在某些操作系统上，以文本模式打开文件会产生与磁盘文件不同的流，这可能导致未定义的行为。 - M.M

1

最初开发为 boost::filesystem，如果你没有 c++17，也可以使用 boost。 - Gerhard Burger

13

使用一个API打开文件，再使用另一个API获取其大小，似乎会导致不一致和竞争条件。 - Arthur Tacca

1

使用std::filesystem::file_size相比于seekg和tellg有什么优势？ - starriet

28

#include <iostream>
#include <sstream>
#include <fstream>

int main()
{
  std::ifstream input("file.txt");
  std::stringstream sstr;

  while(input >> sstr.rdbuf());

  std::cout << sstr.str() << std::endl;
}

或者非常接近。我没有打开stdlib参考来仔细检查。

是的，我明白我没有按照要求编写 slurp 函数。

- Ben Collins

这看起来不错，但它无法编译。为了使其编译，需要进行更改，这将使其与此页面上的其他答案相同。http://ideone.com/EyhfWm - JDiMatteo

6

为什么要使用 while 循环？ - Zitrax

2

同意。当operator>>读入到std::basic_streambuf时，它会消耗（剩余的）输入流，所以循环是不必要的。 - Remy Lebeau

19

我没有足够的声望来直接评论使用tellg()的回复。

请注意，tellg()在出错时可能会返回-1。如果您将tellg()的结果作为分配参数传递，则应首先进行合理性检查。

问题的示例：

...
std::streamsize size = file.tellg();
std::vector<char> buffer(size);
...

在上面的例子中，如果tellg()遇到错误，它将返回-1。在有符号类型（即tellg()的结果）和无符号类型（即vector<char>构造函数的参数）之间进行隐式转换会导致您的向量错误地分配一个非常大的字节数（可能是4294967295字节，或4GB）。

修改paxos1977的答案以解决以上问题：

string readFile2(const string &fileName)
{
    ifstream ifs(fileName.c_str(), ios::in | ios::binary | ios::ate);

    ifstream::pos_type fileSize = ifs.tellg();
    if (fileSize < 0)                             <--- ADDED
        return std::string();                     <--- ADDED

    ifs.seekg(0, ios::beg);

    vector<char> bytes(fileSize);
    ifs.read(&bytes[0], fileSize);

    return string(&bytes[0], fileSize);
}

- Rick Ramstetter

1

不仅如此，tellg() 不返回文件大小而是一个标记。许多系统使用字节偏移量作为标记，但这并不保证，有些系统则不是这样。请参考这个答案获取示例。 - spectras

10

由于这似乎是一个广泛使用的实用程序，我的方法是搜索并优先选择已有的库来替代手工解决方案，特别是如果在您的项目中已经链接了boost库（链接器标志-lboost_system -lboost_filesystem）。在这里（以及旧版本的boost），boost提供了一个load_string_file实用程序：

#include <iostream>
#include <string>
#include <boost/filesystem/string_file.hpp>

int main() {
    std::string result;
    boost::filesystem::load_string_file("aFileName.xyz", result);
    std::cout << result.size() << std::endl;
}

作为一个优势，这个函数不需要寻找整个文件来确定大小，而是在内部使用stat()。但可能会有一个微不足道的缺点：通过检查源代码可以轻松地推断出，字符串被不必要地调整大小，并且用文件内容重写了'\0'字符。

- b.g.

8

这个解决方案为基于rdbuf()的方法添加了错误检查。

std::string file_to_string(const std::string& file_name)
{
    std::ifstream file_stream{file_name};

    if (file_stream.fail())
    {
        // Error opening file.
    }

    std::ostringstream str_stream{};
    file_stream >> str_stream.rdbuf();  // NOT str_stream << file_stream.rdbuf()

    if (file_stream.fail() && !file_stream.eof())
    {
        // Error reading file.
    }

    return str_stream.str();
}

我添加了这个答案，因为在原始方法中添加错误检查并不像你期望的那样简单。原始方法使用stringstream的插入运算符（str_stream << file_stream.rdbuf()）。问题是，当没有插入任何字符时，这会设置stringstream的failbit。这可能是由于错误或文件为空引起的。如果您通过检查failbit来检查失败，当您读取空文件时，您将遇到误报。如何区分合法的未插入任何字符的失败和由于文件为空而“失败”未插入任何字符？

您可能认为要显式检查空文件，但这需要更多的代码和相关的错误检查。

检查失败条件str_stream.fail() && !str_stream.eof()不起作用，因为插入操作不设置eofbit（在ostringstream和ifstream上都不设置）。

因此，解决方案是更改操作。不使用ostringstream的插入运算符（<<），而使用ifstream的提取运算符（>>），它确实设置eofbit。然后检查失败条件file_stream.fail() && !file_stream.eof()。

重要的是，当file_stream >> str_stream.rdbuf()遇到合法的失败时，它不应该设置eofbit（根据我对规范的理解）。这意味着上述检查足以检测到合法的失败。

- tgnottingham

6

这样的东西应该不会太难：

void slurp(std::string& data, const std::string& filename, bool is_binary)
{
    std::ios_base::openmode openmode = ios::ate | ios::in;
    if (is_binary)
        openmode |= ios::binary;
    ifstream file(filename.c_str(), openmode);
    data.clear();
    data.reserve(file.tellg());
    file.seekg(0, ios::beg);
    data.append(istreambuf_iterator<char>(file.rdbuf()), 
                istreambuf_iterator<char>());
}

这里的优势在于我们先进行预留，这样在读取内容时就不需要再扩展字符串长度。缺点是我们逐个字符处理。更聪明的方法是获取整个读取缓冲区，然后调用underflow函数。

- Matt Price

1

你应该检查一下使用std::vector进行初始读取的代码版本，而不是使用字符串。速度会快得多。 - oz10

6

这是一个使用新的文件系统库并具有相当健壮的错误检查的版本:

#include <cstdint>
#include <exception>
#include <filesystem>
#include <fstream>
#include <sstream>
#include <string>

namespace fs = std::filesystem;

std::string loadFile(const char *const name);
std::string loadFile(const std::string &name);

std::string loadFile(const char *const name) {
  fs::path filepath(fs::absolute(fs::path(name)));

  std::uintmax_t fsize;

  if (fs::exists(filepath)) {
    fsize = fs::file_size(filepath);
  } else {
    throw(std::invalid_argument("File not found: " + filepath.string()));
  }

  std::ifstream infile;
  infile.exceptions(std::ifstream::failbit | std::ifstream::badbit);
  try {
    infile.open(filepath.c_str(), std::ios::in | std::ifstream::binary);
  } catch (...) {
    std::throw_with_nested(std::runtime_error("Can't open input file " + filepath.string()));
  }

  std::string fileStr;

  try {
    fileStr.resize(fsize);
  } catch (...) {
    std::stringstream err;
    err << "Can't resize to " << fsize << " bytes";
    std::throw_with_nested(std::runtime_error(err.str()));
  }

  infile.read(fileStr.data(), fsize);
  infile.close();

  return fileStr;
}

std::string loadFile(const std::string &name) { return loadFile(name.c_str()); };

- David G

infile.open can also accept std::string without converting with .c_str() - Matt Eding

filepath 不是一个 std::string，而是一个 std::filesystem::path。原来 std::ifstream::open 也可以接受这种类型的参数。 - David G

@DavidG，std::filesystem::path 可以隐式转换为 std::string。 - Jeffrey Cash

根据cppreference.com的说明，std::ifstream上的::open成员函数接受std::filesystem::path，其操作方式就像在路径上调用::c_str()方法一样。在POSIX下，路径的底层::value_type是char。 - David G

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Konrad Rudolph · Accepted Answer

一种方法是将流缓冲区刷新到单独的内存流中，然后将其转换为std::string（省略错误处理）：

std::string slurp(std::ifstream& in) {
    std::ostringstream sstr;
    sstr << in.rdbuf();
    return sstr.str();
}

这段代码非常简洁。但是，正如问题中所指出的那样，它执行了一个多余的复制操作，不幸的是，基本上没有办法避免这个复制操作。

唯一真正避免冗余复制的解决方案是手动在循环中进行读取。由于C++现在有了保证连续字符串的功能，因此可以编写以下代码（≥C++17，包括错误处理）：

auto read_file(std::string_view path) -> std::string {
    constexpr auto read_size = std::size_t(4096);
    auto stream = std::ifstream(path.data());
    stream.exceptions(std::ios_base::badbit);

    if (not stream) {
        throw std::ios_base::failure("file does not exist");
    }
    
    auto out = std::string();
    auto buf = std::string(read_size, '\0');
    while (stream.read(& buf[0], read_size)) {
        out.append(buf, 0, stream.gcount());
    }
    out.append(buf, 0, stream.gcount());
    return out;
}