如何提高boost interval_map查找的性能

4
我正在使用 boost::icl::interval_map 将字节范围映射到一组字符串。该映射是从(有序的)磁盘文件加载的,然后我使用下面的代码进行查找。
问题在于查找非常缓慢。
在我的测试中,我将66425个范围插入了该映射中。我对代码进行了分析,基本上 > 99.9% 的时间都花在各种 Boost 函数中(没有特别慢的函数,很多时间分布在许多函数上)。
有什么方法可以使其更快?
我的树是否不平衡(如何找出?如何重新平衡?)
使用 set<string> 是否存在问题?
计算地图和窗口的交集是否存在问题?(虽然这是我需要的,所以我看不出还有其他方法可以做到)。
using namespace std;
typedef set<string> objects;
typedef boost::icl::interval_map<uint64_t, objects> ranges;
void
find_range (const ranges *map, uint64_t start, uint64_t end,
            void (*f) (uint64_t start, uint64_t end, const char *object,
                       void *opaque),
            void *opaque)
{
  boost::icl::interval<uint64_t>::type window;
  window = boost::icl::interval<uint64_t>::right_open (start, end);

  ranges r = *map & window;

  ranges::iterator iter = r.begin ();
  while (iter != r.end ()) {
    boost::icl::interval<uint64_t>::type range = iter->first;
    uint64_t start = range.lower ();
    uint64_t end = range.upper ();

    objects obj_set = iter->second;
    objects::iterator iter2 = obj_set.begin ();
    while (iter2 != obj_set.end ()) {
      f (start, end, iter2->c_str (), opaque);
      iter2++;
    }
    iter++;
  }
}

前几个配置文件条目:

  %   cumulative   self              self     total
 time   seconds   seconds    calls  us/call  us/call  name
  9.77      0.13     0.13 21866814     0.01           boost::icl::interval_bounds::interval_bounds(unsigned char)
  6.02      0.21     0.08  9132134     0.01           boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::lower(boost::icl::discrete_interval<unsigned long, std::less> const&)
  6.02      0.29     0.08  6004967     0.01           boost::enable_if<boost::icl::is_discrete_interval<boost::icl::discrete_interval<unsigned long, std::less> >, bool>::type boost::icl::is_empty<boost::icl::discrete_interval<unsigned long, std::less> >(boost::icl::discrete_interval<unsigned long, std::less> const&)
  5.26      0.36     0.07 21210093     0.00           boost::icl::discrete_interval<unsigned long, std::less>::bounds() const
  5.26      0.43     0.07 11964109     0.01           std::less<unsigned long>::operator()(unsigned long const&, unsigned long const&) const
  4.51      0.49     0.06 35761849     0.00           std::_Rb_tree<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_S_left(std::_Rb_tree_node_base const*)
  4.51      0.55     0.06 12009934     0.00           boost::icl::operator==(boost::icl::interval_bounds, boost::icl::interval_bounds)
  3.76      0.60     0.05 12078493     0.00           boost::icl::discrete_interval<unsigned long, std::less>::upper() const
  3.76      0.65     0.05 12077959     0.00           boost::enable_if<boost::icl::is_interval<boost::icl::discrete_interval<unsigned long, std::less> >, boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::domain_type>::type boost::icl::upper<boost::icl::discrete_interval<unsigned long, std::less> >(boost::icl::discrete_interval<unsigned long, std::less> const&)
  3.76      0.70     0.05  8837475     0.01           boost::icl::interval_bounds::bits() const
  3.76      0.75     0.05  6004967     0.01           boost::enable_if<boost::icl::is_interval<boost::icl::discrete_interval<unsigned long, std::less> >, bool>::type boost::icl::domain_less_equal<boost::icl::discrete_interval<unsigned long, std::less> >(boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::domain_type const&, boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::domain_type const&)
  3.01      0.79     0.04  5891650     0.01           boost::icl::is_right_closed(boost::icl::interval_bounds)

数据集:http://oirase.annexia.org/tmp/bmap.txt
完整代码:http://git.annexia.org/?p=virt-bmap.git;a=tree


1
如果你想在这里得到认真的帮助,你需要发布一个自包含的样本和数据集,以便暴露出性能问题。 - sehe
这段代码对你做了什么?boost::icl::interval<uint64_t>::type range = iter->first; 将一个集合变成范围? - Surt
交集(map & window)创建了另一个区间映射(r)。迭代遍历这个映射。iter->first返回迭代器的第一部分(==键== uint64_t区间)。 - Rich
2个回答

16
在这个答案中,我提出了三个优化:
  1. 使用boost::container::flat_set替换对象std::set,以改善局部性(并且很可能减少重新分配成本,因为大多数对象集都是<4个元素)。
  2. 使用string_atom替换对象idstd::string(在技术上是char const*但在逻辑上是唯一的)。这种技术类似于各种VM实现(如Java/.NET)中的内部字符串。
  3. 使用equal_range调用替换映射交集,通过避免复制(大量的)数据来节省大量时间。

更新: 在下面的最终版本中,将boost::container::flat_map简单地替换回std::set会降低性能,仅find_range的因子约为2倍,而find_range_ex则约为4倍,在我的测试系统上。

注意: make_atom的当前实现过于简单,并且永远不会释放原子。您可能希望将字符串备份到deque中,使用Boost Flyweights、池分配器或其中某些组合使其更智能化。但是,是否需要这样做取决于您的用例。

更新: 当仅应用此优化时,速度提升已经达到了26~30倍。请注意,内存使用量大约是另外两个优化的两倍,为20MiB左右。

容量和数据效率

在开始之前,我想了解数据的样式。因此,编写一些代码来解析 bmap.txt 输入,我们可以得到:

在 Coliru 上查看源代码

Parsed ok
Histogram of 66425 input lines
d: 3367
f: 20613
p: 21222
v: 21223
ranges size:            6442450944
ranges iterative size:  21223
Min object set:         1.000000
Max object set:         234.000000
Average object set:     3.129859
Min interval width:     1024.000000
Max interval width:     2526265344.000000
Average interval width: 296.445177k
First:                  [0,1048576)
Last:                   [3916185600,6442450944)
String atoms:           23904 unique in 66425 total
Atom efficiency:        35.986451%

正如您所看到的,这些集合通常包含约3个项目,并且有许多重复项。

使用make_atom对象命名方法和boost::flat_set可以将内存分配从~15GiB降至~10Gib

此优化还将字符串比较减少为指针比较,用于集合插入和interval_map的组合策略,因此对于更大的数据集,这具有潜在的加速作用。

查询效率

部分复制输入确实严重影响了查询性能。

不要复制,而是通过替换来查看重叠范围:

  const ranges r = *map & window;
  ranges::const_iterator iter = r.begin ();
  while (iter != r.end ()) {

  auto r = map->equal_range(window);
  ranges::const_iterator iter = r.first;
  while (iter != r.second) {

在我的系统上,使用两个版本运行10000个相同的随机查询,结果表明速度提升了32倍。
10000 'random' OLD lookups resulted in 156729884 callbacks in 29148ms
10000 'random' NEW lookups resulted in 156729884 callbacks in 897ms

real    0m31.715s
user    0m31.664s
sys 0m0.012s

现在运行时主要由解析/统计控制。完整的基准测试代码在这里:On Coliru

#define BOOST_RESULT_OF_USE_DECTYPE
#define BOOST_SPIRIT_USE_PHOENIX_V3

/* virt-bmap examiner plugin
 * Copyright (C) 2014 Red Hat Inc.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>
#include <assert.h>

#include <boost/icl/interval.hpp>
#include <boost/icl/interval_set.hpp>
#include <boost/icl/interval_map.hpp>
#include <boost/container/flat_set.hpp>

using namespace std;

/* Maps intervals (uint64_t, uint64_t) to a set of strings, where each
 * string represents an object that covers that range.
 */

static size_t atoms_requested = 0;
static size_t atoms_unique_created = 0;

using string_atom = const char*;
string_atom make_atom(std::string&& s)
{
    static std::set<std::string> s_atoms;
    atoms_requested += 1;

    auto it = s_atoms.find(s);
    if (it != s_atoms.end())
        return it->c_str();

    atoms_unique_created += 1;
    return s_atoms.insert(std::move(s)).first->c_str();
}

typedef boost::container::flat_set<string_atom> objects;
typedef boost::icl::interval_map<uint64_t, objects> ranges;

ranges*
new_ranges (void)
{
  return new ranges ();
}

void
free_ranges (ranges *mapv)
{
  ranges *map = (ranges *) mapv;
  delete map;
}

extern "C" void
insert_range (void *mapv, uint64_t start, uint64_t end, const char *object)
{
  ranges *map = (ranges *) mapv;
  objects obj_set;
  obj_set.insert (obj_set.end(), object);
  map->add (std::make_pair (boost::icl::interval<uint64_t>::right_open (start, end), // SEHE added std::
                       obj_set));
}

extern "C" void
iter_range (void *mapv, void (*f) (uint64_t start, uint64_t end, const char *object, void *opaque), void *opaque)
{
  ranges *map = (ranges *) mapv;
  ranges::iterator iter = map->begin ();
  while (iter != map->end ()) {
    boost::icl::interval<uint64_t>::type range = iter->first;
    uint64_t start = range.lower ();
    uint64_t end = range.upper ();

    objects obj_set = iter->second;
    objects::iterator iter2 = obj_set.begin ();
    while (iter2 != obj_set.end ()) {
      f (start, end, *iter2/*->c_str ()*/, opaque); // SEHE
      iter2++;
    }
    iter++;
  }
}

extern "C" void
find_range (void const *mapv, uint64_t start, uint64_t end, void (*f) (uint64_t start, uint64_t end, const char *object, void *opaque), void *opaque)
{
  const ranges *map = (const ranges *) mapv;

  boost::icl::interval<uint64_t>::type window;
  window = boost::icl::interval<uint64_t>::right_open (start, end);

  const ranges r = *map & window;

  ranges::const_iterator iter = r.begin ();
  while (iter != r.end ()) {
    boost::icl::interval<uint64_t>::type range = iter->first;
    uint64_t start = range.lower ();
    uint64_t end = range.upper ();

    objects obj_set = iter->second;
    objects::iterator iter2 = obj_set.begin ();
    while (iter2 != obj_set.end ()) {
      f (start, end, *iter2/*->c_str ()*/, opaque); // SEHE
      iter2++;
    }
    iter++;
  }
}

extern "C" void
find_range_ex (void const *mapv, uint64_t start, uint64_t end, void (*f) (uint64_t start, uint64_t end, const char *object, void *opaque), void *opaque)
{
  const ranges *map = (const ranges *) mapv;

  boost::icl::interval<uint64_t>::type window;
  window = boost::icl::interval<uint64_t>::right_open (start, end);

#if 0
  const ranges r = *map & window;
  ranges::const_iterator iter = r.begin ();
  while (iter != r.end ()) {
#else
  auto r = map->equal_range(window);
  ranges::const_iterator iter = r.first;
  while (iter != r.second) {
#endif

    boost::icl::interval<uint64_t>::type range = iter->first;
    uint64_t start = range.lower ();
    uint64_t end = range.upper ();

    objects obj_set = iter->second;
    objects::iterator iter2 = obj_set.begin ();
    while (iter2 != obj_set.end ()) {
      f (start, end, *iter2/*->c_str ()*/, opaque); // SEHE
      iter2++;
    }
    iter++;
  }
}

#include <memory>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics.hpp>
#include <fstream>
#include <chrono>

std::map<char, size_t> histo;

bool insert_line_of_input(ranges& bmap_data, uint64_t b, uint64_t e, char type, std::string& object) {
    if (!object.empty())
        histo[type]++;
    //std::cout << std::hex << b << " " << e << " " << type << " " << object << "\n";

#if 0
    object.insert(object.begin(), ':');
    object.insert(object.begin(), type);
#endif
    insert_range(&bmap_data, b, e, make_atom(std::move(object)));
    return true;
}

std::vector<std::pair<uint64_t, uint64_t> > generate_test_queries(ranges const& bmap_data, size_t n) {
    std::vector<std::pair<uint64_t, uint64_t> > queries;
    queries.reserve(n);

    for (size_t i = 0; i < n; ++i)
    {
        auto start = (static_cast<uint64_t>(rand()) * rand()) % bmap_data.size();
        auto end   = start + rand();

        queries.emplace_back(start,end);
    }

    return queries;
}

ranges read_mapfile(const char* fname) {
    std::ifstream ifs(fname);
    boost::spirit::istream_iterator f(ifs >> std::noskipws), l;

    ranges bmap_data;

    namespace phx = boost::phoenix;
    using namespace boost::spirit::qi;
    uint_parser<uint64_t, 16> offset;
    if (!phrase_parse(f,l,
                ("1 " >> offset >> offset >> char_("pvdf") >> as_string[lexeme[+graph] >> attr('/') >> lexeme[*~char_("\r\n")]]) 
                [ _pass = phx::bind(insert_line_of_input, phx::ref(bmap_data), _1, _2, _3, _4) ] % eol >> *eol, 
                blank))
    {
        exit(255);
    } else
    {
        std::cout << "Parsed ok\n";
    }

    if (f!=l)
        std::cout << "Warning: remaining input '" << std::string(f,l) << "\n";

    return bmap_data;
}

void report_statistics(ranges const& bmap_data) {
    size_t total = 0;
    for (auto e : histo) total += e.second;

    std::cout << "Histogram of " << total << " input lines\n";

    for (auto e : histo)
        std::cout << e.first << ": " << e.second << "\n";

    namespace ba = boost::accumulators;
    ba::accumulator_set<double, ba::stats<ba::tag::mean, ba::tag::max, ba::tag::min> > 
        object_sets, interval_widths;

    for (auto const& r : bmap_data)
    {
        auto width = r.first.upper() - r.first.lower();
        assert(width % 1024 == 0);

        interval_widths(width);
        object_sets(r.second.size());
    }

    std::cout << std::fixed;
    std::cout << "ranges size:            " << bmap_data.size()                 << "\n";
    std::cout << "ranges iterative size:  " << bmap_data.iterative_size()       << "\n";

    std::cout << "Min object set:         " << ba::min(object_sets)             << "\n" ;
    std::cout << "Max object set:         " << ba::max(object_sets)             << "\n" ;
    std::cout << "Average object set:     " << ba::mean(object_sets)            << "\n" ;
    std::cout << "Min interval width:     " << ba::min(interval_widths)         << "\n" ;
    std::cout << "Max interval width:     " << ba::max(interval_widths)         << "\n" ;
    std::cout << "Average interval width: " << ba::mean(interval_widths)/1024.0 << "k\n" ;
    std::cout << "First:                  " << bmap_data.begin()->first         << "\n" ;
    std::cout << "Last:                   " << bmap_data.rbegin()->first        << "\n" ;

    std::cout << "String atoms:           " << atoms_unique_created << " unique in " << atoms_requested << " total\n";
    std::cout << "Atom efficiency:        " << (atoms_unique_created*100.0/atoms_requested) << "%\n";
}

void perform_comparative_benchmarks(ranges const& bmap_data, size_t number_of_queries) {
    srand(42);
    auto const queries = generate_test_queries(bmap_data, number_of_queries);

    using hrc = std::chrono::high_resolution_clock;
    {
        auto start = hrc::now();
        size_t callbacks = 0;

        for (auto const& q: queries)
        {
            find_range(&bmap_data, q.first, q.second, 
                    [](uint64_t start, uint64_t end, const char *object, void *opaque) {
                    ++(*static_cast<size_t*>(opaque));
                    }, &callbacks);
        }
        std::cout << number_of_queries << " 'random' OLD lookups resulted in " << callbacks 
                  << " callbacks in " << std::chrono::duration_cast<std::chrono::milliseconds>((hrc::now()-start)).count() << "ms\n";
    }

    {
        auto start = hrc::now();
        size_t callbacks = 0;

        for (auto const& q: queries)
        {
            find_range_ex(&bmap_data, q.first, q.second, 
                    [](uint64_t start, uint64_t end, const char *object, void *opaque) {
                    ++(*static_cast<size_t*>(opaque));
                    }, &callbacks);
        }
        std::cout << number_of_queries << " 'random' NEW lookups resulted in " << callbacks 
                  << " callbacks in " << std::chrono::duration_cast<std::chrono::milliseconds>((hrc::now()-start)).count() << "ms\n";
    }
}

int main() {
    auto bmap = read_mapfile("bmap.txt");

    report_statistics(bmap);

    perform_comparative_benchmarks(bmap, 1000);

#if 0 // to dump ranges to console
    for (auto const& r : bmap)
    {
        std::cout << r.first << "\t" << r.second.size() << "\t";
        std::copy(r.second.begin(), r.second.end(), std::ostream_iterator<std::string>(std::cout, "\t"));
        std::cout << "\n";
    }
#endif
}

@Chris 我已经将与各种优化相关的更改以及基准驱动程序代码提交到了一个**公共github分支**,以便轻松审查。Makefile包含在q27152834.mak中。 - sehe
根据此链接,在 make_atom 中使用 std::unordered_set 可以使该函数的速度再提升一个数量级。 - Surt
@Surt 一切都取决于情况。无论如何,优化 make_atom 并不重要,它与程序运行时间无关,当然也与 find_rages_times 无关 :) (你最好考虑是否需要修复资源泄漏问题)。始终遵循分析器的建议! - sehe
这只是因为我真的很喜欢只有一个常量字符串副本的想法,所以我立即采用了set :) 为了防止泄漏,你可以像这样制作using strset = std :: unordered_set <std :: string>; strset&GetStrSet(){static strset theOne; return theOne;} ,并将make_atom 更改为 static strset&s_atoms = GetStrSet();,当您想停止使用它时,请调用 GetStrSet().clear(); - Surt
明白了 :) 无论如何,我提出了几种类似的方法。如果你问我,这取决于OP。 - sehe

1

这段代码存在许多问题。

  1. 内存使用。
    由于有66425个范围,您的L1缓存已经超载,并且包含的字符串集也会使L2D超载,甚至可能超过L3。这意味着每次数据访问都会有50-200个CPU周期的延迟,而您的乱序执行仅涵盖约32个周期,这意味着CPU基本上会一直停滞。如果通过硬件预取器顺序访问所有内存,则可以大大缓解这种情况。

  2. 通过映射进行指针跟踪。
    映射和集通常作为带指针的rb_tree实现。(interval_map可能不同?) 为了进一步增加问题,指针的访问将确保数据不是按顺序访问的,这意味着您将受到高延迟的影响。

  3. 调用函数指针/虚拟函数。
    令人惊讶的是,除非在f内使用更多的interval函数,否则这不会出现在前12位。稍后当您解决了其他问题时,您可能会发现对该函数的每次调用都会引入X个周期的延迟,其中X是CPU流水线的长度。

如果您使用perf来获取性能数据,请添加一个带有perf stat -d运行结果的结果。这将显示上述问题,包括大量缓存未命中和空闲CPU。
使用set<string>是不好的,因为它需要指针追踪,您应该改用vector<string>,但需要自己保持排序。这应该加快对f的访问速度,但不能解决其他问题。
interval_map中添加分配器,实现arena,可能会加快访问速度,因为数据应该有更好的本地化。

好的,虽然66K数据范围似乎不是太多的数据量。也许有其他的结构可以更好地处理这个问题? - Rich
66K * sizeof(node) * averageSize(set<string>) * averageLength(string) = 虽然超过L3的限制,但并不是致命问题。64K323216=2^162^52^52^4=2^30或一个GB,这个估计合理吗?为了解决这个问题,您需要重新考虑整个问题,也许可以对所有结构使用向量。 - Surt
1
输入文件的大小为3285763字节。(当然,输入文件不是内存数据结构,但我希望内存数据不会大一个数量级)。我的机器有4 MB的L3缓存。然而,访问具有很多局部性。我设法通过手动缓存相同查找重复两次的情况来使代码运行得更快。这对我来说听起来不像是缓存问题。 - Rich
@Rich,请尝试运行 perf stat -d - Surt
@Rich好消息,我的答案有三个简单的调整,结合起来可以创造出约32倍的性能提升 - sehe

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接