多个分隔符的快速字符串分割

Question

多个分隔符的快速字符串分割

28

我在StackOverflow上调查了一些时间，寻找将带有多个分隔符的字符串拆分成vector< string >的好算法。我还找到了一些方法：

Boost方法：

boost::split(vector, string, boost::is_any_of(" \t"));

the getline method:

std::stringstream ss(string);
std::string item;
while(std::getline(ss, item, ' ')) {
    vector.push_back(item);
}

Boost的分词方式：

char_separator<char> sep(" \t");
tokenizer<char_separator<char>> tokens(string, sep);
BOOST_FOREACH(string t, tokens)
{
   vector.push_back(t);
}

还有一种很酷的STL方式：

     istringstream iss(string);
     copy(istream_iterator<string>(iss),
     istream_iterator<string>(),
     back_inserter<vector<string> >(vector));

这些方法大多来自于此主题，但是很遗憾它们都不能解决我的问题：

Boost的split使用简单，但在处理大数据（最好情况下约1.5*10^6个单独元素）和使用约10个分隔符时，速度非常慢。
getline、STL和Shadow2531的方法存在一个问题，即我只能使用一个单一字符作为分隔符，而我需要更多。
Boost的tokenize在速度方面更加缓慢。使用10个分隔符将字符串拆分成1.5*10^6个元素需要11秒钟。

所以我不知道该怎么办：我想要一个真正快速的字符串拆分算法，并且可以使用多个分隔符。

Boost的split是否已达到极限？还是有更快的方法？

注：请保留HTML标签。

- Paul

你尝试过多线程吗？例如，将你的分隔符(delimiter)分成N组，然后为每个组运行一个线程来按照该分隔符组拆分字符串并填充到一个列表中，然后再将这些列表重新合并起来。 - Nick Banks

肯定有更快的方法。这个命令：cat loremipsum_big.txt | ruby -e "ary = Array.new; ARGF.each {|x| ary << x.split(/a| /)}; puts ary" | wc -l 在虚拟化的 Ubuntu 上仅用了3.74秒就创建了2,516,715个元素，并将它们推入了一个数组中。 - karatedog

5

以下是来自http://www.codeproject.com/KB/recipes/Tokenizer.aspx的一些示例，它们非常高效。String Toolkit Library使得C++中复杂的字符串处理变得简单易行。 - user760162

3个回答

2

为了结合Pablo和larsmans的回答的优点，使用(offset, size)对存储子字符串，并使用strcspn获取每个条目的范围。

- MSN

1

在处理如此大的字符串时，使用Ropes可能会更加划算。或者像Pablo推荐的那样使用字符串视图：(char const*,size_t) 对。如果你有一个好的 strpbrk 实现，那么 bitset 技巧就不再必要了。

- Fred Foo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pablo · Accepted Answer

有两个建议：

使用字符串视图代替字符串作为分割结果，可以节省大量的内存分配。
如果您知道只会处理字符（在[0,255]范围内），尝试使用位集来测试成员资格，而不是查找分隔符字符。

这里是应用这些想法的快速尝试：

#include <vector>
#include <bitset>
#include <iostream>
#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/classification.hpp>
#include <boost/timer.hpp>

using namespace std;
size_t const N = 10000000;

template<typename C>
void test_custom(string const& s, char const* d, C& ret)
{
  C output;

  bitset<255> delims;
  while( *d )
  {
    unsigned char code = *d++;
    delims[code] = true;
  }
  typedef string::const_iterator iter;
  iter beg;
  bool in_token = false;
  for( string::const_iterator it = s.begin(), end = s.end();
    it != end; ++it )
  {
    if( delims[*it] )
    {
      if( in_token )
      {
        output.push_back(typename C::value_type(beg, it));
        in_token = false;
      }
    }
    else if( !in_token )
    {
      beg = it;
      in_token = true;
    }
  }
  if( in_token )
    output.push_back(typename C::value_type(beg, s.end()));
  output.swap(ret);
}

template<typename C>
void test_strpbrk(string const& s, char const* delims, C& ret)
{
  C output;

  char const* p = s.c_str();
  char const* q = strpbrk(p+1, delims);
  for( ; q != NULL; q = strpbrk(p, delims) )
  {
    output.push_back(typename C::value_type(p, q));
    p = q + 1;
  }

  output.swap(ret);
}

template<typename C>
void test_boost(string const& s, char const* delims)
{
  C output;
  boost::split(output, s, boost::is_any_of(delims));
}

int main()
{
  // Generate random text
  string text(N, ' ');
  for( size_t i = 0; i != N; ++i )
    text[i] = (i % 2 == 0)?('a'+(i/2)%26):((i/2)%2?' ':'\t');

  char const* delims = " \t[],-'/\\!\"§$%&=()<>?";

  // Output strings
  boost::timer timer;
  test_boost<vector<string> >(text, delims);
  cout << "Time: " << timer.elapsed() << endl;

  // Output string views
  typedef string::const_iterator iter;
  typedef boost::iterator_range<iter> string_view;
  timer.restart();
  test_boost<vector<string_view> >(text, delims);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string> vs;
  test_custom(text, delims, vs);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string_view> vsv;
  test_custom(text, delims, vsv);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string> vsp;
  test_strpbrk(text, delims, vsp);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string_view> vsvp;
  test_strpbrk(text, delims, vsvp);
  cout << "Time: " << timer.elapsed() << endl;

  return 0;
}

使用启用-O4选项的GCC 4.5.1编译Boost 1.46.1时，我得到以下结果：

时间：5.951秒（Boost.Split + vector）
时间：3.728秒（Boost.Split + vector）
时间：1.662秒（自定义split + vector）
时间：0.144秒（自定义split + vector）
时间：2.13秒（Strpbrk + vector）
时间：0.527秒（Strpbrk + vector）

注意：由于我的自定义函数会删除空令牌，因此输出略有差异。但是如果您决定使用它，可以根据自己的需求调整代码。