NUMA感知缓存对齐的内存分配

Question

NUMA感知缓存对齐的内存分配

11

在Linux系统中，pthread库提供了一个函数（posix_memalign）用于缓存对齐以防止虚假共享。为了选择特定的NUMA节点，我们可以使用libnuma库。我想要的是需要同时使用这两个库。我想将某些线程绑定到特定的处理器，并希望为每个线程从相应的NUMA节点分配本地数据结构，以便减少线程内存操作的延迟。如何实现这一点？

- Mustafa Zengin

2个回答

8

如果你只是想在NUMA分配器周围获得对齐功能，那么你可以轻松地构建自己的分配器。

思路就是调用非对齐的 malloc() 并多分配一点空间。然后返回第一个对齐的地址。为了能够释放它，你需要将基地址存储在已知位置。

这里有一个示例，只需将名称替换为适当的名称：

pint         //  An unsigned integer that is large enough to store a pointer.
NUMA_malloc  //  The NUMA malloc function
NUMA_free    //  The NUMA free function

void* my_NUMA_malloc(size_t bytes,size_t align, /* NUMA parameters */ ){

    //  The NUMA malloc function
    void *ptr = numa_malloc(
        (size_t)(bytes + align + sizeof(pint)),
        /* NUMA parameters */
    );

    if (ptr == NULL)
        return NULL;

    //  Get aligned return address
    pint *ret = (pint*)((((pint)ptr + sizeof(pint)) & ~(pint)(align - 1)) + align);

    //  Save the free pointer
    ret[-1] = (pint)ptr;

    return ret;
}

void my_NUMA_free(void *ptr){
    if (ptr == NULL)
        return;

    //  Get the free pointer
    ptr = (void*)(((pint*)ptr)[-1]);

    //  The NUMA free function
    numa_free(ptr); 
}

当您使用此功能时，需要针对使用 my_NUMA_malloc 分配的任何内容调用 my_NUMA_free。

- Mysticial

这回答了我隐含的问题，即我是否可以直接在NUMA malloc周围使用对齐方式，谢谢。 - Mustafa Zengin

这是对一个不存在的问题的好答案。 numa_alloc*() 函数已经返回在页面级别上对齐的内存，通常（总是？）是缓存行大小的倍数。 - Rob_before_edits

这仅适用于numa.h中的numa-alloc函数在操作系统/页面级别上运行。在某些情况下，会围绕numa-alloc函数构建类似malloc的包装器/库，以使其更有效。在这种情况下，对齐将不再是页面，并且将受到堆动态的影响。 - Mysticial

在函数my_NUMA_free中，我认为使用ptr = ((void **)ptr)[-1]会更好看。 - CplusPuzzle

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rob_before_edits · Accepted Answer

numa_alloc_*()函数在libnuma库中分配整个页面的内存，通常为4096字节。缓存行通常为64字节。由于4096是64的倍数，从numa_alloc_*()返回的任何内容都已经在缓存级别上对齐了。

但是要注意numa_alloc_*()函数。虽然在man手册上说它们比相应的malloc()慢，我相信这是真的，但我遇到的更大问题是：同时运行在许多核心上的numa_alloc_*()分配会遭受到巨大的争用问题。在我的情况下，使用numa_alloc_onnode()替换malloc()没有实质性区别（使用本地内存获得的好处被增加的分配/释放时间所抵消）；tcmalloc比两者都要快。我同时在32个线程/核心上执行了数千个12-16kb的mallocs。时间实验表明，不是numa_alloc_onnode()的单线程速度导致了我的进程花费大量时间执行分配，这使得锁定/争用问题成为可能的原因。我采用的解决方案是numa_alloc_onnode()一次性分配大块内存，然后根据需要将其分配给每个节点上运行的线程。我使用gcc原子内置函数允许每个线程（我将线程固定在cpu上）从每个节点上分配的内存中抓取。如果需要，您可以以缓存行大小对齐分配的内容：我会这样做。这种方法甚至击败了tcmalloc（它是线程感知的，但不是NUMA感知的 - 至少Debain Squeeze版本似乎不是）。这种方法的缺点是您无法释放单个分配（至少没有更多的工作），您只能释放整个基础on-node分配。但是，如果这是函数调用的临时on-node刮擦空间，或者您可以指定何时不再需要该内存，则此方法非常有效。显然，如果您能够预测需要在每个节点上分配多少内存，那就更好了。

void *my_malloc(struct node_memory *nm,int node,long size)
{
  long off,obytes;

  // round up size to the nearest cache line size
  // (optional, though some rounding is essential to avoid misalignment problems)

  if ((obytes = (size % CACHE_LINE_SIZE)) > 0)
    size += CACHE_LINE_SIZE - obytes;

  // atomically increase the offset for the requested node by size

  if (((off = __sync_fetch_and_add(&(nm->off[node]),size)) + size) > nm->bytes) {
    fprintf(stderr,"Out of allocated memory on node %d\n",node);
    return(NULL);
  }
  else
    return((void *) (nm->ptr[node] + off));

}

结构体node_memory是什么

struct node_memory {
  long bytes;         // the number of bytes of memory allocated on each node
  char **ptr;         // ptr array of ptrs to the base of the memory on each node
  long *off;          // array of offsets from those bases (in bytes)
  int nptrs;          // the size of the ptr[] and off[] arrays
};

通过使用libnuma函数numa_alloc_onnode()，nm->ptr[node]会被设置好。

我通常也会在结构体中存储允许的节点信息，这样my_malloc()可以检查节点请求是否合理，而无需进行函数调用。我还会检查nm是否存在以及size是否合理。函数__sync_fetch_and_add()是gcc内置的原子函数；如果您没有使用gcc编译，您需要使用其他函数。我使用原子操作，因为根据我的有限经验，在高线程/核心计数条件下（例如在4P NUMA机器上），它们比互斥锁快得多。