为什么内存分配器不主动将释放的内存返回给操作系统?

12

是的,你可能已经看到了这段代码的第三次,因为我之前有两个问题与它相关 (这个这个)。 这段代码相当简单:

#include <vector>
int main() {
    std::vector<int> v;
}

然后我使用Valgrind在Linux上构建并运行它:

g++ test.cc && valgrind ./a.out
==8511== Memcheck, a memory error detector
...
==8511== HEAP SUMMARY:
==8511==     in use at exit: 72,704 bytes in 1 blocks
==8511==   total heap usage: 1 allocs, 0 frees, 72,704 bytes allocated
==8511==
==8511== LEAK SUMMARY:
==8511==    definitely lost: 0 bytes in 0 blocks
==8511==    indirectly lost: 0 bytes in 0 blocks
==8511==      possibly lost: 0 bytes in 0 blocks
==8511==    still reachable: 72,704 bytes in 1 blocks
==8511==         suppressed: 0 bytes in 0 blocks
...
==8511== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

这里,Valgrind报告没有内存泄漏,即使有1个alloc和0个free。

答案在这里指出C++标准库使用的分配器不一定会将内存交还给操作系统-它可能会将它们保留在内部缓存中。

问题是:

1)为什么要将它们保留在内部缓存中?如果是为了速度,为什么更快?是的,操作系统需要维护一个数据结构来跟踪内存分配,但是这个缓存的维护者也需要这样做。

2)这是如何实现的?因为我的程序已经终止,没有其他进程正在维护此内存缓存-或者,有吗?

编辑:对于问题(2)-我看到的一些答案建议“C++运行时”,这是什么意思?如果“C++运行时”是C++库,但该库只是静态库(.a)上的一堆机器代码,它并不是正在运行的进程-这些机器代码是链接到我的a.out或在a.out的运行时(共享对象,.so)中调用的。


我认为你想问为什么分配器不将内存返回给操作系统,但是你的例子并没有这样做。Valgrind 在标准库/运行时级别工作,因此它会报告一些你尚未“返回到运行时”的东西,并且如下面我的答案所述,由于未使用的向量,你甚至无法以你想要的方式进行测试。更好的测试方法是进行大型分配(可以使用 vector 或只是 new char[1234567]),并观察操作系统报告的内存使用量是否增加了类似的数量,然后释放它并观察它可能不会下降(或者可能会,这取决于情况!)。 - BeeOnRope
1个回答

阿里云服务器只需要99元/年,新老用户同享,点击查看详情
24

澄清

首先,需要澄清一下。你问道:...我的程序a.out已经终止,没有其他进程在维护这个内存缓存——或者,有吗?

我们所讨论的一切都发生在单个进程的生命周期内:当进程退出时,它总是返回所有已分配的内存。没有任何超出进程寿命的缓存存在1。即使没有运行时分配器的帮助,内存也会在进程终止时被操作系统“回收”。因此,使用正常分配的已终止应用程序不可能导致系统范围的泄漏。

现在,Valgrind报告的是在进程终止之前正在使用的内存,但在操作系统清理所有东西之前。它在运行时库级别上工作,而不是在操作系统级别上。因此,它在说:“嘿,当程序完成时,还有72,000字节没有返回给运行时”,但未明示的暗示是“这些分配将很快被操作系统清理掉”。

潜在问题

代码和Valgrind输出并不真正与标题问题相关,因此让我们分开来看。首先,我们将尝试回答有关分配器的问题:为什么它们存在,以及为什么它们通常不会立即将释放的内存返回给操作系统,忽略示例。

你问道:

1) 为什么要将它们保留在内部缓存中?如果是为了速度,那么它如何更快?是的,操作系统需要维护数据结构来跟踪内存分配情况,但这个缓存的维护者也需要这样做。

这有点像两个问题:一个是为什么要有用户空间运行时分配器,另一个是(也许?)为什么这些分配器在释放内存时不立即将其返回给操作系统。它们是相关的,但让我们一次解决一个。

运行时分配器存在的原因

为什么不只依赖于操作系统的内存分配例程呢?

  • 许多操作系统,包括大多数Linux和其他类Unix的操作系统,根本没有操作系统系统调用来分配和释放任意块的内存。类Unix的操作系统提供brk,它只增加或缩小一块连续的内存 - 您无法“释放”任意先前的分配。它们还提供mmap,允许您独立地分配和释放内存块,但这些是以PAGE_SIZE粒度分配的,在Linux上为4096字节。因此,如果您要请求32个字节,如果没有自己的分配器,则必须浪费4096-32 == 4064字节。在这些操作系统上,您实际上需要一个单独的内存分配运行时,将这些粗粒度工具转换为能够有效分配小块的东西。

    Windows有点不同。它有HeapAlloc调用,它是“OS”的一部分,并提供了类似于malloc的功能,可分配和释放任意大小的内存块。然后,使用某些编译器,malloc只是作为HeapAlloc的薄包装实现(这个调用的性能在最近的Windows版本中得到了很大改善,这使得这种实现是可行的)。尽管HeapAllocOS的一部分,但它并没有在内核中实现 - 它也主要在用户模式库中实现,管理空闲块和已使用块的列表,并偶尔从内核获取内存块。因此,它基本上是另一种形式的malloc,它所持有的任何内存对于任何其他进程也不可用。

  • 性能!即使有适当的内核级调用来分配任意块的内存,简单的往返内核开销通常为数百纳秒或更长时间。另一方面,经过良好调整的malloc分配或释放通常只有十几个指令,并且可能在10纳秒或更短的时间内完成。除此之外,系统调用无法“信任其输入”,因此必须仔细验证从用户空间传递的参数。在free的情况下,这意味着它必须检查用户是否传递了有效的指针!大多数运行时free实现只是崩溃或默默地损坏内存,因为没有责任保护进程免受自身的影响。
  • 与语言运行时的更紧密联系。您在C ++中用于分配内存的函数,即newmalloc和朋友们,是由语言定义的一部分。因此,将它们作为实现语言其余部分的运行时的一部分而不是大多数情况下与语言无关的操作系统是完全自然的。例如,语言可能对各种对象具有特定的对齐要求,这可以通过语言感知分配器最好地处理。语言或编译器的更改也可能意味着必须更改分配例程,并且希望内核更新以适应您的语言功能是一个棘手的问题!

为什么不将内存归还给操作系统

虽然你的示例没有显示出来,但如果你编写了一个不同的测试,你可能会发现在分配和释放大量内存后,进程在操作系统中报告的常驻集大小和/或虚拟大小可能不会在释放后减小。也就是说,似乎进程仍然保留着内存,即使你已经释放了它。事实上,这是许多malloc实现的特点。首先,请注意,这不是一个泄漏 - 未返回的内存仍然可供分配它的进程使用,即使对其他进程不可用。

为什么他们这样做?以下是一些原因:

  1. The kernel API makes it hard. For the old-school brk and sbrk system calls, it simply isn't feasible to return freed memory unless it happens to be at the end of very last block allocated from brk or sbrk. That's because the abstraction offered by these calls is a single large contiguous region that you can only extend from one end. You can't hand back memory from the middle of it. Rather than trying to support the unusual case where all the freed memory happens to be at the end of brk region, most allocators don't even bother.

    The mmap call is more flexible (and this discussion generally applies also to Windows where VirtualAlloc is the mmap equivalent), allowing you to at least return memory at a page granularity - but even that is hard! You can't return a page until all allocations that are part of that page are freed. Depending on the size and allocation/free pattern of the application that may be common or uncommon. A case where it works well is for large allocations - greater than a page. Here you're guaranteed to be able to free most of the allocation if it was done via mmap and indeed some modern allocators satisfy large allocations directly from mmap and free them back to the OS with munmap. For glibc (and by extension the C++ allocation operators), you can even control this threshold:

    M_MMAP_THRESHOLD
      For allocations greater than or equal to the limit specified
      (in bytes) by M_MMAP_THRESHOLD that can't be satisfied from
      the free list, the memory-allocation functions employ mmap(2)
      instead of increasing the program break using sbrk(2).
    
      Allocating memory using mmap(2) has the significant advantage
      that the allocated memory blocks can always be independently
      released back to the system.  (By contrast, the heap can be
      trimmed only if memory is freed at the top end.)  On the other
      hand, there are some disadvantages to the use of mmap(2):
      deallocated space is not placed on the free list for reuse by
      later allocations; memory may be wasted because mmap(2)
      allocations must be page-aligned; and the kernel must perform
      the expensive task of zeroing out memory allocated via
      mmap(2).  Balancing these factors leads to a default setting
      of 128*1024 for the M_MMAP_THRESHOLD parameter.
    

    So by default allocations of 128K or more will be allocated by the runtime directly from the OS and freed back to the OS on free. So sometimes you will see the behavior you might have expected is always the case.

  2. Performance! Every kernel call is expensive, as described in the other list above. Memory that is freed by a process will be needed shortly later to satisfy another allocation. Rather than trying to return it to the OS, a relatively heavyweight operation, why not just keep it around on a free list to satisfy future allocations? As pointed out in the man page entry, this also avoids the overhead of zeroing out all the memory returned by the kernel. It also gives the best chance of good cache behavior since the process is continually re-using the same region of the address space. Finally, it avoids TLB flushes which would be imposed by munmap (and possibly by shrinking via brk).
  3. The "problem" of not returning memory is the worst for long-lived processes that allocate a bunch of memory at some point, free it and then never allocate that much again. I.e., processes whose allocation high-water mark is larger than their long term typical allocation amount. Most processes just don't follow that pattern, however. Processes often free a lot of memory, but allocate at a rate such that their overall memory use is constant or perhaps increasing. Applications that do have the "big then small" live size pattern could perhaps force the issue with malloc_trim.
  4. Virtual memory helps mitigate the issue. So far I've been throwing around terms like "allocated memory" without really defining what it means. If a program allocates and then frees 2 GB of memory and then sits around doing nothing, is it wasting 2 GB of actual DRAM plugged into your motherboard somewhere? Probably not. It is using 2 GB of virtual address space in your process, sure, but virtual address space is per-process, so that doesn't directly take anything away from other processes. If the process actually wrote to the memory at some point, it would be allocated physical memory (yes, DRAM) - after freeing it, you are - by definition - no longer using it. At this point the OS may reclaim those physical pages by use for someone else.

    Now this still requires you have swap to absorb the dirty not-used pages, but some allocators are smart: they can issue a madvise(..., MADV_DONTNEED) call which tells the OS "this range doesn't have anything useful, you don't have to preserve its contents in swap". It still leaves the virtual address space mapped in the process and usable later (zero filled) and so it's more efficient than munmap and a subsequent mmap, but it avoid pointlessly swapping freed memory regions to swap.2

演示的代码

正如这个答案所指出的那样,你使用 vector<int> 进行测试并不会真正测试任何东西,因为一个空的、未使用的 std::vector<int> v 只要你使用了一些最小级别的优化,都不会创建向量对象。即使没有进行优化,也不太可能发生分配,因为大多数 vector 实现是在第一次插入时分配内存,而不是在构造函数中。最后,即使你正在使用某些不寻常的编译器或库进行分配,它也只会占用几个字节,而不是 Valgrind 报告的 ~72,000 字节。

你应该像这样做才能真正看到 vector 分配的影响:

#include <vector>

volatile vector<int> *sink;

int main() {
    std::vector<int> v(12345678);
    sink = &v;
}

这会导致实际分配和释放。然而,这不会改变Valgrind的输出,因为向量分配在程序退出之前被正确释放,所以就Valgrind而言没有问题。

从高层次来看,Valgrind基本上将事物分类为“确定泄漏”和“未在退出时释放”。前者发生在程序不再引用它分配的内存指针时。它无法释放这种内存,因此已经泄漏了。在退出时未被释放的内存可能是一个“泄漏”——即应该被释放但开发人员知道它将在整个程序的生命周期中一直存在,因此不需要显式地释放(由于全局变量的构造顺序问题,特别是当涉及共享库时,可能很难可靠地释放与全局或静态对象相关联的内存,即使您想要释放也是如此)。


1在一些情况下,一些特殊的分配可能会超过进程的生命周期,例如共享内存和内存映射文件,但这与普通的C++分配无关,您可以在本讨论中忽略它。

2最近的Linux内核还具有Linux特定的MADV_FREE,其语义似乎类似于MADV_DONTNEED


我知道这是一个有些陈旧的帖子,但它对我提出的问题非常有帮助:https://dev59.com/yrzpa4cB1Zd3GeqPRcex。是否有一种方法可以提取有关进程释放但尚未被操作系统从该进程中回收的内存的信息,例如Linux工具/命令?我想检查`ps`中进程的“rss”值的一部分是否可供回收。如果这个问题很傻,请原谅,但我想在这里问一下。谢谢。 - Francis
@Francis - 我在那边回答了你的问题。 - BeeOnRope
许多人都存在的一个问题,需要全面清晰地回答。 - SRobertJames

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,