怎样最高效地将文件读入到std::string中？

Question

怎样最高效地将文件读入到std::string中？

3

我目前这样做，但是在最后转换为std :: string的过程中，占用了98％的执行时间。一定有更好的方法！

std::string
file2string(std::string filename)
{
    std::ifstream file(filename.c_str());
    if(!file.is_open()){
        // If they passed a bad file name, or one we have no read access to,
        // we pass back an empty string.
        return "";
    }
    // find out how much data there is
    file.seekg(0,std::ios::end);
    std::streampos length = file.tellg();
    file.seekg(0,std::ios::beg);
    // Get a vector that size and
    std::vector<char> buf(length);
    // Fill the buffer with the size
    file.read(&buf[0],length);
    file.close();
    // return buffer as string
    std::string s(buf.begin(),buf.end());
    return s;
}

- phorgan1

1

为什么不直接使用char*来读取，然后使用string(const char * s, size_t n)构造函数呢？ - akappa

2

如果你想以字符串（在mmap的情况下是char*）的形式高效地访问大文件，你可能需要查看一下mmap。 - Aaron McDaid

1

可能是将整个ASCII文件读入C++ std :: string的重复问题。 - Luc Touraille

我添加了另一个版本，请在你的基准测试中试一下 :-) 它与Luc链接中的被接受答案相同。 - Kerrek SB

可能是在C++中将整个文件读入std :: string的最佳方法是什么？和将整个ASCII文件读入C ++ std :: string的重复问题。 - jww

4个回答

5

你可以尝试这个：

#include <fstream>
#include <sstream>
#include <string>

int main()
{
  std::ostringstream oss;
  std::string s;
  std::string filename = get_file_name();

  if (oss << std::ifstream(filename, std::ios::binary).rdbuf())
  {
    s = oss.str();
  }
  else
  {
    // error
  }

  // now s contains your file     
}

你也可以直接使用oss.str()，如果你喜欢的话；只要确保在某个地方有某种错误检查即可。

不能保证它是最有效的；你可能无法打败<cstdio>和fread。正如@Benjamin指出的那样，字符串流仅通过复制公开数据，因此你可以直接读取到目标字符串中：

#include <string>
#include <cstdio>

std::FILE * fp = std::fopen("file.bin", "rb");
std::fseek(fp, 0L, SEEK_END);
unsigned int fsize = std::ftell(fp);
std::rewind(fp);

std::string s(fsize, 0);
if (fsize != std::fread(static_cast<void*>(&s[0]), 1, fsize, fp))
{
   // error
}

std::fclose(fp);

你可能会想要使用RAII包装器来处理FILE*。

编辑：第二个版本的fstream类似于以下内容：

#include <string>
#include <fstream>

std::ifstream infile("file.bin", std::ios::binary);
infile.seekg(0, std::ios::end);
unsigned int fsize = infile.tellg();
infile.seekg(0, std::ios::beg);

std::string s(fsize, 0);

if (!infile.read(&s[0], fsize))
{
   // error
}

编辑：另一个版本，使用流缓冲区迭代器：

std::ifstream thefile(filename, std::ios::binary);
std::string s((std::istreambuf_iterator<char>(thefile)), std::istreambuf_iterator<char>());

（注意添加括号以获得正确的解析。）

- Kerrek SB

我非常确定这个移动操作并没有带来任何好处。ostringstream::str() 返回的是值。 - Benjamin Lindley

@BenjaminLindley：哦，好主意。那就直接使用oss.str()吧。 - Kerrek SB

我制作了一个框架，调用每个函数1000次以读取一个1.3M的jpeg文件。Kerrek的第一次用时19秒，第二次用时6秒。我的用时14秒，David的用时2分21秒。C++能否高效地使用标准模板库中的文件I/O元素？ - phorgan1

@Patrick：如果这能让你感到安慰的话，C库是C++标准库的一部分，所以使用<cstdio>并不丢人。但是请让我发布另一个使用<fstream>的版本。敬请关注。 - Kerrek SB

1

我不知道它有多高效，但这里有一种简单（易读）的方法，只需将EOF设置为分隔符：

string buffer;

ifstream fin;
fin.open("filename.txt");

if(fin.is_open()) {
    getline(fin,buffer,'\x1A');

fin.close();
}

显然，这取决于getline算法内部发生了什么，因此您可以查看标准库中的代码以了解其工作原理。

- derpface

1

具有讽刺意味的是，字符串::reserve的示例正在将文件读入字符串中。您不希望将文件读入一个缓冲区，然后再分配/复制到另一个缓冲区。

以下是示例代码：

// string::reserve
#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main ()
{
  string str;
  size_t filesize;

  ifstream file ("test.txt",ios::in|ios::ate);
  filesize=file.tellg();

  str.reserve(filesize); // allocate space in the string

  file.seekg(0);
  for (char c; file.get(c); )
  {
    str += c;
  }
  cout << str;
  return 0;
}

- David Schwartz

我同意，我不确定他们为什么选择那种方式。重要的是要使用 str.reserve 只进行单次分配，然后读入字符串。 - David Schwartz

拥有一个正确的例子也很重要，不是吗？希望你不介意这个编辑。 - Benjamin Lindley

我创建了一个框架，用于将这些函数调用1000次以读取1.3M的jpeg。 Kerrek的第一次花费了19秒，第二次花费了6秒。我的花费了14秒，而David的则花费了2分21秒。 C ++能否高效地使用标准模板库元素执行文件I / O操作？ - phorgan1

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Dietmar Kühl · Accepted Answer

作为C++迭代器抽象和算法的忠实粉丝，我希望下面的方法能够成为读取文件（或任何其他输入流）并将其存储到std::string中（然后打印内容）的最快方式：

#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>

int main()
{
    std::string s(std::istreambuf_iterator<char>(std::ifstream("file")
                                                 >> std::skipws),
                  std::istreambuf_iterator<char>());
    std::cout << "file='" << s << "'\n";
}

这对于我自己实现的IOStreams来说确实很快，但需要很多技巧才能真正做到快速。主要是需要优化算法来处理分段序列：流可以看作是输入缓冲区的序列。我不知道任何STL实现会一直进行此优化。奇怪的std::skipws用法只是为了获取新创建的流的引用：因为std::istreambuf_iterator<char>需要一个引用，而临时文件流无法绑定。

由于这可能不是最快的方法，我倾向于使用std::getline()并指定一个特定的“换行符”，例如在文件中不存在的字符：

std::string s;
// optionally reserve space although I wouldn't be too fuzzed about the
// reallocations because the reads probably dominate the performances
std::getline(std::ifstream("file") >> std::skipws, s, 0);

这假设文件中没有空字符，任何其他字符也可以。不幸的是，std::getline()需要一个char_type作为定界参数，而不是成员std::istream::getline()所需的int_type：在这种情况下，您可以使用eof()表示永远不会出现的字符（char_type、int_type和eof()指的是char_traits<char>的相应成员）。反过来，成员版本不能使用，因为您需要事先知道文件中有多少个字符。

顺便说一句，我看到一些尝试使用寻找来确定文件大小的方法，但这样做效果不佳。问题在于在std::ifstream（实际上在std::filebuf）中执行的代码转换可能会产生与文件中的字节数不同的字符数。诚然，在使用默认C语言环境时，这种情况并不适用，并且可以检测到这不进行任何转换。否则，流的最佳选择是运行文件并确定正在生成的字符数。我认为，当代码转换可能会产生一些有趣的东西时，这实际上就是需要完成的工作，尽管我不认为它确实已经完成了。但是，没有一个示例明确设置了C语言环境，例如使用std::locale::global(std::locale("C"));。即使有这个设置，也需要以std::ios_base::binary模式打开文件，因为否则在读取时可能会将行尾序列替换为单个字符。诚然，这只会使结果更短，从不会更长。

其他使用从std::streambuf*（即涉及rdbuf()的那些）提取的方法都要求在某个时刻复制结果内容。考虑到文件实际上可能非常大，这可能不是一个选项。没有复制，这可能是最快的方法。但是，为了避免复制，可以创建一个简单的自定义流缓冲区，该流缓冲区以std::string的引用作为构造函数参数，并直接附加到此std::string中：

#include <fstream>
#include <iostream>
#include <string>

class custombuf:
    public std::streambuf
{
public:
    custombuf(std::string& target): target_(target) {
        this->setp(this->buffer_, this->buffer_ + bufsize - 1);
    }

private:
    std::string& target_;
    enum { bufsize = 8192 };
    char buffer_[bufsize];
    int overflow(int c) {
        if (!traits_type::eq_int_type(c, traits_type::eof()))
        {
            *this->pptr() = traits_type::to_char_type(c);
            this->pbump(1);
        }
        this->target_.append(this->pbase(), this->pptr() - this->pbase());
        this->setp(this->buffer_, this->buffer_ + bufsize - 1);
        return traits_type::not_eof(c);
    }
    int sync() { this->overflow(traits_type::eof()); return 0; }
};

int main()
{
    std::string s;
    custombuf   sbuf(s);
    if (std::ostream(&sbuf)
        << std::ifstream("readfile.cpp").rdbuf()
        << std::flush) {
        std::cout << "file='" << s << "'\n";
    }
    else {
        std::cout << "failed to read file\n";
    }
}

使用适当选择的缓冲区，我预计版本会非常快。哪个版本最快肯定取决于系统、所使用的标准C++库以及可能的其他因素，即你需要测量性能。