在C和C++中对齐堆数组以便于编译器（GCC）矢量化

Question

在C和C++中对齐堆数组以便于编译器（GCC）矢量化

c++gccdynamic-memory-allocationvectorizationmemory-alignment

8

我正在为std::vector设计一个包装容器模板类，它会自动创建一个元素的多分辨率金字塔。

现在的关键问题是，我希望金字塔的创建能够被(GCC) 自动向量化。

所有存储在std::vector和我的分辨率金字塔内部的数据数组都是使用标准new或allocator模板参数在堆上创建的。是否有一种方法可以帮助编译器强制对我的数据进行特定的对齐，以便向量化可以在具有最佳对齐（通常为16）的元素（数组）（块）上操作。

因此，我正在使用自定义分配器AlignmentAllocator，但是GCC自动向量化消息输出仍然声称std::mr_vector::construct_pyramid第144行中的未对齐内存在multi_resolution.hpp中包含表达式。

for (size_t s = 1; s < snum; s++) { // for each cached scale
...
}

如下所示：

tests/../multi_resolution.hpp:144: note: Detected interleaving *D.3088_68 and MEM[(const value_type &)D.3087_61]
tests/../multi_resolution.hpp:144: note: versioning for alias required: can't determine dependence between *D.3088_68 and *D.3082_53
tests/../multi_resolution.hpp:144: note: mark for run-time aliasing test between *D.3088_68 and *D.3082_53
tests/../multi_resolution.hpp:144: note: versioning for alias required: can't determine dependence between MEM[(const value_type &)D.3087_61] and *D.3082_53
tests/../multi_resolution.hpp:144: note: mark for run-time aliasing test between MEM[(const value_type &)D.3087_61] and *D.3082_53
tests/../multi_resolution.hpp:144: note: found equal ranges MEM[(const value_type &)D.3087_61], *D.3082_53 and *D.3088_68, *D.3082_53
tests/../multi_resolution.hpp:144: note: Vectorizing an unaligned access.
tests/../multi_resolution.hpp:144: note: Vectorizing an unaligned access.
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: strided group_size = 2 .
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: unaligned supported by hardware.
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: inside_cost = 4, outside_cost = 0 .
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: unaligned supported by hardware.
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
tests/../multi_resolution.hpp:144: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
tests/../multi_resolution.hpp:144: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
tests/../multi_resolution.hpp:144: note: vect_model_store_cost: unaligned supported by hardware.
tests/../multi_resolution.hpp:144: note: vect_model_store_cost: inside_cost = 2, outside_cost = 0 .
tests/../multi_resolution.hpp:144: note: cost model: Adding cost of checks for loop versioning aliasing.

tests/../multi_resolution.hpp:144: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .
tests/../multi_resolution.hpp:144: note: Cost model analysis: 
  Vector inside of loop cost: 10
  Vector outside of loop cost: 21
  Scalar iteration cost: 5
  Scalar outside cost: 1
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 7

tests/../multi_resolution.hpp:144: note:   Profitability threshold = 6

tests/../multi_resolution.hpp:144: note: Profitability threshold is 6 loop iterations.
tests/../multi_resolution.hpp:144: note: create runtime check for data references *D.3088_68 and *D.3082_53
tests/../multi_resolution.hpp:144: note: created 1 versioning for alias checks.

tests/../multi_resolution.hpp:144: note: LOOP VECTORIZED.

我可以在指针值从memalign返回时，以某种方式（强类型）指定其对齐方式，以便GCC可以确定由data（）指向的区域具有所需的对齐方式（在这种情况下为16）吗？

/ Per

multi_resolution.hpp中mr_vector模板类的代码：

/*!
 * @file: multi_resolution.hpp
 * @brief: Multi-Resolution Containers.
 * @author: Copyright (C) 2011 Per Nordlöw (per.nordlow@gmail.com)
 * @date: 2011-06-29 12:22
 */

#pragma once

#include <vector>
#include <algorithm>
#include "bitwise.hpp"
#include "mean.hpp"
#include "allocators.hpp"
#include "ostream_x.hpp"

namespace std
{

/*! Multi-Resolution Vector with Allocator Alignment for each Level. */
//template<typename _Tp, typename _Alloc = std::allocator<_Tp> >
template<typename _Tp, std::size_t _Alignment = 16>
class mr_vector
{
    // Concept requirements.
    typedef AlignmentAllocator<_Tp, _Alignment> _Alloc;
    typedef typename _Alloc::value_type                _Alloc_value_type;
    __glibcxx_class_requires(_Tp, _SGIAssignableConcept)
    __glibcxx_class_requires2(_Tp, _Alloc_value_type, _SameTypeConcept)

    typedef _Vector_base<_Tp, _Alloc>            _Base;
    typedef typename _Base::_Tp_alloc_type       _Tp_alloc_type;
public:
    typedef _Tp                                      value_type;
    typedef typename _Tp_alloc_type::pointer         pointer;
    typedef typename _Tp_alloc_type::const_pointer   const_pointer;
    typedef typename _Tp_alloc_type::reference       reference;
    typedef typename _Tp_alloc_type::const_reference const_reference;
    typedef size_t                                   size_type;
    typedef ptrdiff_t                                difference_type;
    typedef _Alloc                                   allocator_type;

protected:
    // using _Base::_M_allocate;
    // using _Base::_M_deallocate;
    // using _Base::_M_impl;
    // using _Base::_M_get_Tp_allocator;

public:
    mr_vector(size_t n)
        : m_bot(n), m_datas(nullptr), m_sizes(nullptr) { construct_pyramid(); }
    mr_vector(size_t n, value_type value)
        : m_bot(n, value), m_datas(nullptr), m_sizes(nullptr) { construct_pyramid(); }
    mr_vector(const mr_vector & in)
        : m_bot(in.m_bot), m_datas(nullptr), m_sizes(nullptr) { construct_pyramid(); }

    mr_vector operator = (mr_vector & in) {
        if (this != &in) {
            delete_pyramid();
            m_bot = in.m_bot;
            construct_pyramid();
        }
    }

    ~mr_vector() { delete_pyramid(); }

    // Get Standard Scale Size.
    size_type size() const { return m_bot.size(); }
    // Get Normal Scale Data.
    value_type*       data() { return m_bot.data(); }
    const value_type* data() const { return m_bot.data(); }

    // Get Size at scale @p scale.
    size_type size(size_t scale) const { return m_sizes[scale]; }

    // Get Data at scale @p scale.
    value_type*       data(size_t scale) { return m_datas[scale]; }
    const value_type* data(size_t scale) const { return m_datas[scale]; }

    // Get Standard Element at index @p i.
    value_type& operator[](size_t i) { return m_bot[i]; }
    // Get Constant Standard Element at index @p i.
    const value_type& operator[](size_t i) const { return m_bot[i]; }

    // Get Constant Standard Element at scale @p scale at index @p i.
    value_type*       operator()(size_t scale, size_t i) { return m_datas[scale][i]; }
    const value_type* operator()(size_t scale, size_t i) const { return m_datas[scale][i]; }

    void resize(size_t n) {
        bool ch = (n != size());
        if (ch) { delete_pyramid(); }
        m_bot.resize(n);
        if (ch) { construct_pyramid(); }
    }

    void push_back(const _Tp & a) {
        delete_pyramid();
        m_bot.push_back(a);
        construct_pyramid();
    }
    void pop_back() {
        if (size()) { delete_pyramid(); }
        m_bot.pop_back();
        if (size()) { construct_pyramid(); }
    }
    void clear() {
        if (size()) { delete_pyramid(); }
        m_bot.clear();
    }

    /*! Print @p v to @p os. */
    friend std::ostream & operator << (std::ostream & os,
                                       const mr_vector & v)
    {
        for (size_t s = 0; s < v.scale_count(); s++) { // for each cached scale
            os << "scale:" << s << ' ';
            print_each(os, v.m_datas[s], v.m_datas[s]+v.m_sizes[s]);
            os << std::endl;
        }
        return os;
    }

protected:
    size_t scale_count(size_t sz) const { return pnw::binlog(sz)+1; } // one extra for bottom
    size_t scale_count() const { return scale_count(size()); }

    /// Construct Pyramid Bottom-Up starting at scale @p scale.
    void construct_pyramid() {
        if (not m_datas) {      // if no multi-scala yet
            const size_t snum = scale_count();
            if (snum >= 1) {
                m_datas = new value_type* [snum]; // allocate data pointers
                m_sizes = new size_type [snum];   // allocate lengths

                // first level is just copy
                m_datas[0] = m_bot.data();
                m_sizes[0] = m_bot.size();
            }
            for (size_t s = 1; s < snum; s++) { // for each cached scale
                auto sq = m_sizes[s-1] / 2;     // quotient
                auto sr = m_sizes[s-1] % 2;     // rest
                auto sn = m_sizes[s] = sq+sr;
                m_datas[s] = m_alloc.allocate(sn * sizeof(value_type*));
                for (size_t i = 0; i < sq; i++) { // for each dyadic reduction
                    m_datas[s][i] = pnw::arithmetic_mean(m_datas[s-1][2*i+0],
                                                         m_datas[s-1][2*i+1]);
                }
                if (sr) {       // if rest
                    m_datas[s][sq] = m_datas[s-1][2*sq+0] / 2; // extrapolate with zeros
                }
            }
        }
    }

    /// Delete Pyramid.
    void delete_pyramid() {
        if (m_datas) {        // if no multi-scala given yet1
            const size_t snum = scale_count();
            for (size_t s = 1; s < snum; s++) { // for each scale
                m_alloc.deallocate(m_datas[s], sizeof(value_type)); // clear level
            }
            delete[] m_datas; m_datas = nullptr; // deallocate scale pointers
            delete[] m_sizes; m_sizes = nullptr; // deallocate scale pointers
        }
    }

    /// Reconstruct Pyramid.
    void reconstruct_pyramid(size_t scale = 0) {
        delete_pyramid();
        construct_pyramid();
    }

private:
    std::vector<value_type, _Alloc> m_bot; ///< Bottom Resolutions.
    mutable value_type** m_datas; ///< Pyramid Resolutions Datas (Cache). Slaves under @c m_bot.
    mutable size_type* m_sizes; ///< Pyramid Resolution Lengths. Slaves under @c m_bot.
    _Alloc m_alloc;
};

}

自定义分配器AlignmentAllocator的代码位于allocators.hpp中：

/*!
 * @file: allocators.hpp
 * @brief: Custom Allocators.
 * @author: Copyright (C) 2009 Per Nordlöw (per.nordlow@gmail.com)
 * @date: 2009-01-12 16:42
 * @see http://ompf.org/forum/viewtopic.php?f=11&t=686
 * On Windows use @c _aligned_malloc_() and @c _aligned_free_().
 */

#pragma once

#include <cstdlib>              // @c size_t
#if defined (__WIN32__) && ! defined (_POSIX_VERSION) // Windows
#  include <malloc.h>           // @c memalign()
#elif defined (__GNUC__)        // GNU
#  include <malloc.h>           // @c memalign()
#else                           // Rest
#endif

/*!
 * Allocator with Specific @em Alignment.
 */
template <typename _Tp, std::size_t N = 16>
class AlignmentAllocator
{
public:
    typedef _Tp value_type;
    typedef std::size_t size_type;
    typedef std::ptrdiff_t difference_type;

    typedef _Tp * pointer;
    typedef const _Tp * const_pointer;

    typedef _Tp & reference;
    typedef const _Tp & const_reference;

public:
    inline AlignmentAllocator () throw () { }

    template <typename T2>
    inline AlignmentAllocator (const AlignmentAllocator<T2, N> &) throw () { }

    inline ~AlignmentAllocator () throw () { }

    inline pointer adress (reference r) { return &r; }

    inline const_pointer adress (const_reference r) const { return &r;
    }

    inline pointer allocate (size_type n)
    {
#if defined (__WIN32__) && ! defined (_POSIX_VERSION) // Windows
        return (pointer)memalign(N, n*sizeof(value_type));
#elif defined (__GNUC__)        // GNU
        return (pointer)memalign(N, n*sizeof(value_type));
#else  // Rest
        return (pointer)_mm_malloc (n*sizeof(value_type), N);
#endif
    }

    inline void deallocate (pointer p, size_type)
    {
#if defined (__WIN32__) && ! defined (_POSIX_VERSION) // Window
        return free(p);
#elif defined (__GNUC__)        // GNU
        return free(p);
#else  // Rest
        _mm_free (p);
#endif
    }

    inline void construct (pointer p, const value_type & wert) { new (p) value_type (wert); }

    inline void destroy (pointer p) { p->~value_type (); }

    inline size_type max_size () const throw () { return size_type (-1) / sizeof (value_type);     }

    template <typename T2>
    struct rebind { typedef AlignmentAllocator<T2, N> other; };
};

- Nordlöw

我的理解是，std::vector<DataType> 使用 operator new 分配空间。operator new 分配的空间是针对给定的 DataType 对齐的。我将此留给语言专家来纠正。 - Thomas Matthews

@Thomas：一个向量使用其分配器来分配内存。默认的分配器确实如你所说，但你可以指定其他的分配器。 - jalf

3个回答

1

你的答案可能是C++11 scoped_allocator吗？

这允许你将一个有状态的分配器传递给元素以及向量。对于m_bot、m_datas、m_sizes和value_type，使用相同的自定义分配器。

或者也许我疯了，value_type不需要分配器。

- emsr

1

也许你应该定义自己的分配器来替换默认的分配器，这样你就可以完全控制自己的内存布局。

- yangrenyong

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- sqykly · Accepted Answer

既然你正在使用向量化，我认为这是一种优化，并且这些是大型数组。那样的话，为什么不使用VirtualAlloc，并获得保证以64k边界对齐的多个64k的数组？例如：

template<class T> T* getBigAlignedArray(unsigned count) {
    return ((T*) VirtualAlloc(NULL, sizeof(T)*count, (MEM_RESERVE | MEM_COMMIT), PAGE_READWRITE));
};
template<class T> void freeBigAlignedArray(T* pThing) {
    VirtualFree((LPVOID) pThing, 0, MEM_RELEASE);
};

对我来说似乎更加透明。