在着色器中遍历包围盒层次结构

Question

在着色器中遍历包围盒层次结构

glslcompiler-optimizationvulkanraytracingbounding-box

9

我正在使用Vulkan计算着色器开发路径跟踪器。我实现了一棵代表边界体层次结构的树。 BVH的思想是将光线相交测试需要执行的对象数量最小化。

#1天真的实现

我的第一个实现非常快，它遍历整个树直到BVH树的一个叶子节点。但是，射线可能会相交多个叶子节点。该代码会导致某些三角形未被渲染（尽管它们应该被渲染）。

int box_index = -1;

for (int i = 0; i < boxes_count; i++) {
    // the first box has no parent, boxes[0].parent is set to -1
    if (boxes[i].parent == box_index) {
        if (intersect_box(boxes[i], ray)) {
            box_index = i;
        }
    }
}

if (box_index > -1) {
    uint a = boxes[box_index].ids_offset;
    uint b = a + boxes[box_index].ids_count;

    for (uint j = a; j < b; j++) {
        uint triangle_id = triangle_references[j];
        // triangle intersection code ...
    }
}

#2 多叶片实现

我的第二个实现考虑到了可能有多个叶片相交的情况。然而，这个实现比实现 #1 慢36倍（好吧，在 #1 中我错过了一些相交测试，但仍然……）。

bool[boxes.length()] hits;
hits[0] = intersect_box(boxes[0], ray);

for (int i = 1; i < boxes_count; i++) {
    if (hits[boxes[i].parent]) {
        hits[i] = intersect_box(boxes[i], ray);
    } else {
        hits[i] = false;
    }
}

for (int i = 0; i < boxes_count; i++) {
    if (!hits[i]) {
        continue;
    }

    // only leaves have ids_offset and ids_count defined (not set to -1)
    if (boxes[i].ids_offset < 0) {
        continue;
    }

    uint a = boxes[i].ids_offset;
    uint b = a + boxes[i].ids_count;

    for (uint j = a; j < b; j++) {
        uint triangle_id = triangle_references[j];
        // triangle intersection code ...
    }
}

这种性能差异让我很疯狂。似乎只有像if(dynamically_modified_array[some_index])这样的单个语句对性能有巨大影响。我怀疑SPIR-V或GPU编译器不再能够发挥其优化魔力？因此，这是我的问题：

这确实是一个优化问题吗？
如果是，我可以将实现#2转换为更好的可优化版本吗？我是否可以给出优化提示？
在着色器中有一种标准方法来实现BVH树查询吗？

- jns

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jns · Accepted Answer

经过一番探索，我找到了解决方案。重要的是要理解BVH树不排除需要评估所有叶子节点的可能性。

下面的方案＃3使用命中和错过链接。需要按照最坏情况进行排序，以正确顺序查询所有框（因此只需要单个循环）。然而，链接用于跳过不需要评估的节点。当当前节点是叶子节点时，执行实际的三角形相交。

命中链接~在命中的情况下要跳转到哪个节点（下面的绿色）
错过链接~在错过的情况下要跳转到哪个节点（下面的红色）

图片来源于这里。相关论文和源代码也在Toshiya Hachisuka教授的页面上。同样的概念也在幻灯片引用的这篇论文中描述。

#3 BVH树命中和错过链接

我必须将数据扩展到着色器中推送的链接。还需要一些离线操作来正确存储树。起初，我尝试使用while循环（循环直到box_index_next为-1），但再次导致了疯狂的减速。无论如何，以下工作速度相当快：

int box_index_next = 0;

for (int box_index = 0; box_index < boxes_count; box_index++) {
    if (box_index != box_index_next) {
        continue;
    }

    bool hit = intersect_box(boxes[box_index], ray);
    bool leaf = boxes[box_index].ids_count > 0;

    if (hit) {
        box_index_next = boxes[box_index].links.x; // hit link
    } else {
        box_index_next = boxes[box_index].links.y; // miss link
    }

    if (hit && leaf) {
        uint a = boxes[box_index].ids_offset;
        uint b = a + boxes[box_index].ids_count;

        for (uint j = a; j < b; j++) {
            uint triangle_id = triangle_references[j];
            // triangle intersection code ...
        }
    }
}

这段代码比快速但有缺陷的实现 #1 慢了大约 3 倍。这在一定程度上是可以预料的，因为现在速度取决于实际树的情况，而不是 GPU 优化。例如，考虑一个三角形沿着一个轴对齐的退化情况：朝着同一方向的光线可能与所有三角形相交，那么就需要评估所有树叶。

教授 Toshiya Hachisuka 在他的幻灯片中为这种情况提出了进一步优化（第 36 页及以后）：可以存储多个沿着 x、-x、y、-y、z 和 -z 空间排序的 BVH 树版本。在遍历时，需要基于光线选择正确的版本。然后可以在遇到来自叶子节点的三角形时立即停止遍历，因为所有剩余待访问的节点都将在此节点之后（从光线的角度看）。

一旦构建了 BVH 树，查找链接就非常简单（以下是一些 Python 代码）：

class NodeAABB(object):

    def __init__(self, obj_bounds, obj_ids):
        self.children = [None, None]
        self.obj_bounds = obj_bounds
        self.obj_ids = obj_ids

    def split(self):
        # split recursively and create children here
        raise NotImplementedError()

    def is_leaf(self):
        return set(self.children) == {None}

    def build_links(self, next_right_node=None):
        if not self.is_leaf():
            child1, child2 = self.children

            self.hit_node = child1
            self.miss_node = next_right_node

            child1.build_links(next_right_node=child2)
            child2.build_links(next_right_node=next_right_node)

        else:
            self.hit_node = next_right_node
            self.miss_node = self.hit_node

    def collect(self):
        # retrieve in depth first fashion for correct order
        yield self
        if not self.is_leaf():
            child1, child2 = self.children
            yield from child1.collect()
            yield from child2.collect()

在将所有AABB存储在数组中后（该数组将被发送到GPU），您可以使用hit_node和miss_node查找链接的索引并存储它们。