Boost无锁单生产者单消费者队列和缓存内存访问

Question

Boost无锁单生产者单消费者队列和缓存内存访问

memoryboostcpu-architecturelock-freecpu-cache

4

我目前的多线程项目需要非常关注速度/延迟问题。

缓存访问是我正在尝试更好地理解的内容。我不清楚无锁队列（例如boost :: lockfree :: spsc_queue）在缓存级别上如何访问/使用内存。

我已经看到过使用队列，其中将需要由消费者核心操作的大型对象的指针推入队列。

如果消费者核心从队列弹出一个元素，我认为这意味着该元素（在这种情况下是指针）已经加载到消费者核心的L2和L1高速缓存中。但要访问该元素，它不需要通过查找并从L3高速缓存或跨互连（如果其他线程位于不同的CPU插槽上）加载元素本身的指针来访问吗？如果是这样，可能直接发送可以被消费者处理的对象的副本会更好吗？

谢谢。

- PaulD

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- sehe · Accepted Answer

C++主要是一个“按需付费”的生态系统。任何常规队列都可以让您选择存储语义（按值或按引用）。然而，这次您订购了一种特殊的无锁队列。为了无锁，它必须能够将所有可观察的修改操作执行为原子操作。这自然限制了可以直接在这些操作中使用的类型。您可能会怀疑是否可能有一个超过系统本机寄存器大小（例如int64_t）的值类型。好问题。

进入环形缓冲区。事实上，任何基于节点的容器只需要对所有修改操作进行指针交换，这在所有现代架构上都是微不足道的原子操作。但是，如果涉及复制多个不同的内存区域，并且顺序非原子性，则真的会带来无法解决的问题吗？不。想象一下POD数据项的平坦数组。现在，如果您将数组视为循环缓冲区，则只需要以原子方式维护缓冲区前端和后端位置的索引。容器可以在内部更新“脏前索引”，同时在外部前面进行复制。（复制可以使用放松的内存排序）。只有在整个复制已知已完成时，才会更新外部前索引。此更新需要在acq_rel / cst内存顺序中进行。只要容器能够保护不变式，即前端永远不会完全绕过并到达后端，这是一笔甜蜜的交易。我认为这个想法在Disruptor Library（LMAX的著名库）中得到了推广。您可以从以下方面获得机械共振：1.在读/写时进行线性内存访问模式2.如果您可以使记录大小与物理缓存行对齐，则更好3.除非POD包含记录外的原始引用，否则所有数据都是本地的。

那么Boost的spsc_queue实际上是如何做到这一点的呢？

Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)

typedef typename boost::aligned_storage<max_size * sizeof(T),
                                        boost::alignment_of<T>::value
                                       >::type storage_type;

storage_type storage_;

T * data()
{
    return static_cast<T*>(storage_.address());
}

(The element type T need not even be POD, but it needs to be both default-constructible and copyable).

Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):

static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;

In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
- branch prediction hints for platforms that support it (unlikely())
- it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
- use of std::unitialized_copy where possible
- The calling of trivial constructors/destructors will be optimized out at instantiation time
- the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)

总的来说，我们看到了一个最佳实践，可以用于环形缓冲区。

使用什么：

Boost 给你提供了所有选择。你可以选择将元素类型作为指向消息类型的指针。然而，正如你在问题中已经提出的那样，这种间接级别会降低引用的局部性，并且可能不是最优的。

另一方面，如果复制代价昂贵，则将完整的消息类型存储在元素类型中可能变得昂贵。至少要尝试使元素类型适合缓存行（通常在英特尔上为 64 字节）中。

因此，在实践中，您可能需要考虑将频繁使用的数据存储在值中，并使用指针引用较少使用的数据（除非遍历指针的成本很低）。

如果您需要“附着模型”，请考虑使用自定义分配器来引用数据，以便您也可以在那里实现内存访问模式。

让您的分析器指导您。

注释：

1. 我认为对于 spsc acq_rel 应该可以工作，但我有点生疏细节。原则上，我不写无锁代码。我建议其他人也效仿我的做法 :)