C++高性能FIFO的容器

Question

C++高性能FIFO的容器

3

我需要优化一些遗留代码，但对C++还不是很熟悉。

该代码使用两个线程进行网络数据包处理，一个线程将数据包推送到FIFO队列[topupBuffer]，另一个线程从队列中读取数据并通过IP套接字发送出去[writeToIPOutput]。遗留代码使用std::deque实现FIFO队列。

然而，运行程序会使用大量CPU资源，高达50%（应该是5%左右）。运行gprof似乎揭示了std::deque的问题。（我不确定自己是否正确地解释了分析结果，所以需要帮助）

分析输出的摘录： topupBuffer层次结构：

index % time    self  children    called     name
                0.65    2.51       1/1           DvIPFilePlayback::topupBufferThreadMethod(void*) [2]
[1]     60.5    0.65    2.51       1         DvIPFilePlayback::topupBuffer() [1]
                0.27    1.15 4025575/4025575     DvIPPlaybackBC::bufferizeTsPackets(TPlaybackBuffer&, int&, int&) [5]
                0.03    0.56 4026668/4026668     std::deque<TTsPacket, std::allocator<TTsPacket> >::push_back(TTsPacket const&) [6]
                0.03    0.15 4046539/5749754     std::deque<TPlaybackBuffer, std::allocator<TPlaybackBuffer> >::size() const [17]

并且

[5]     27.2    0.27    1.15 4025575         DvIPPlaybackBC::bufferizeTsPackets(TPlaybackBuffer&, int&, int&) [5]
                0.04    0.30 4031674/4031674     std::deque<TTsPacket, std::allocator<TTsPacket> >::pop_front() [11]
                0.03    0.30 8058004/8058004     std::deque<TTsPacket, std::allocator<TTsPacket> >::size() const [12]
                0.01    0.19  576183/576183      DvPlaybackBC::insertToPlaybackBuffer(TPlaybackBuffer const&) [22]
                0.04    0.11 4029401/4029401     std::deque<TTsPacket, std::allocator<TTsPacket> >::front() [25]

写入IPOutput层次结构

[3]     36.8    0.92    1.00       1         DvIPPlaybackBC::writeToIPOutput() [3]
                0.31    0.00 1129444/1129444     TPlaybackBuffer::operator=(TPlaybackBuffer const&) [13]
                0.01    0.18  579235/1155128     std::deque<TPlaybackBuffer, std::allocator<TPlaybackBuffer> >::push_back(TPlaybackBuffer const&) [8]
                0.03    0.10 1135318/1135318     std::deque<TPlaybackBuffer, std::allocator<TPlaybackBuffer> >::pop_front() [27]

我猜测writeToIPOutput花费了太多的时间在分配上。我可以处理这个问题。但是topupBuffer的时间是在std::deque中度过的。

这是否是对性能分析输出的正确解释？

如果是的话，那么使用不同的容器会更高效，如果是的话，使用哪种容器？

谢谢

编辑I：调用树结尾的解释说明如下：

% time  This is the percentage of the `total' time that was spent
        in this function and its children.  Note that due to
        different viewpoints, functions excluded by options, etc,
        these numbers will NOT add up to 100%.

self    This is the total amount of time spent in this function.

children    This is the total amount of time propagated into this
        function by its children.

所以看看bufferizeTsPackets，其中1.15在它的子项中消耗，其中0.30 + 0.30 + 0.11 = 0.71 在不同的队列方法(push_back, size等)中消耗。对吧？所以0.71超过了子项中总时间(1.15)的一半（??）

- Danny

你尝试过使用我们心爱的朋友std::vector吗？对于FIFO，虽然平凡，但你可能想考虑将一个向量作为底层容器的std::queue。 - WhiZTiM

我想知道 boost::lockfree::queue 是否更为适用。 - NathanOliver

尝试让deque中的每个项目都成为工作项的短向量，这样您就不必对共享数据结构进行太多操作。 - Ben Voigt

2

我不确定你的理解是否正确：std::deque 的 self+children 数字并不在顶部，大部分时间都花费在分配器上。你应该能够通过优化你的代码部分来获得更多的改进，直到 std::deque 开始排在列表的顶部。 - Sergey Kalinichenko

1

看起来你的函数是自己忙碌的，deque与你的问题无关；因为deque命令的ms/call数量很低，但是你的函数却很高。 - UKMonkey

显示剩余2条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Thomas Matthews · Accepted Answer

更高效的结构是使用数组实现循环队列（环形缓冲区）。由于数组大小固定，你要么使数组足够大以避免数据溢出，要么只存储最后N个值，其中N是缓冲区的容量。许多嵌入式系统使用数组来减少动态内存分配引起的内存碎片问题。如果你的数组足够小，则可以将其放入处理器的数据缓存中，从而加速计算。