I am working on a path tracer using vulkan compute shaders. I implemented a tree representing a bounding volume hierachy. The idea of the BVH is to minimize the amount of objects a ray intersection test needs to be performed on.
#1 Naive Implementation
My first implementation is very fast, it traverses the tree down to a single leaf of the BVH tree. However, the ray might intersect multiple leaves. This code then leads to some triangles not being rendered (although they should).
int box_index = -1;
for (int i = 0; i < boxes_count; i++) {
    // the first box has no parent, boxes[0].parent is set to -1
    if (boxes[i].parent == box_index) {
        if (intersect_box(boxes[i], ray)) {
            box_index = i;
        }
    }
}
if (box_index > -1) {
    uint a = boxes[box_index].ids_offset;
    uint b = a + boxes[box_index].ids_count;
    for (uint j = a; j < b; j++) {
        uint triangle_id = triangle_references[j];
        // triangle intersection code ...
    }
}
#2 Multi-Leaf Implementation
My second implementation accounts for the fact that multiple leaves might be intersected. However, this implementation is 36x slower than implementation #1 (okay, I miss some intersection tests in #1, but still...).
bool[boxes.length()] hits;
hits[0] = intersect_box(boxes[0], ray);
for (int i = 1; i < boxes_count; i++) {
    if (hits[boxes[i].parent]) {
        hits[i] = intersect_box(boxes[i], ray);
    } else {
        hits[i] = false;
    }
}
for (int i = 0; i < boxes_count; i++) {
    if (!hits[i]) {
        continue;
    }
    // only leaves have ids_offset and ids_count defined (not set to -1)
    if (boxes[i].ids_offset < 0) {
        continue;
    }
    uint a = boxes[i].ids_offset;
    uint b = a + boxes[i].ids_count;
    for (uint j = a; j < b; j++) {
        uint triangle_id = triangle_references[j];
        // triangle intersection code ...
    }
}
This performance difference drives me crazy. It seems only having a single statement like if(dynamically_modified_array[some_index]) has a huge impact on performance. I suspect that the SPIR-V or GPU compiler is no longer able to do its optimization magic? So here are my questions:
Is this indeed an optimization problem?
If yes, can I transform implementation #2 to be better optimizable? Can I somehow give optimization hints?
Is there a standard way to implement BVH tree queries in shaders?
A bounding volume hierarchy (BVH) is a tree structure on a set of geometric objects. All geometric objects, that form the leaf nodes of the tree, are wrapped in bounding volumes. These nodes are then grouped as small sets and enclosed within larger bounding volumes.
Bounding Volume Hierarchy (BVH) is a popular ray tracing acceleration technique that uses a tree-based “acceleration structure” that contains multiple hierarchically-arranged bounding boxes (bounding volumes) that encompass or surround different amounts of scene geometry or primitives.
After some digging, I found a solution. Important to understand is that the BVH tree does not exclude the possibility that one needs to evaluate all leaves.
Implementation #3 below, uses hit and miss links. The boxes need to be sorted in a way that in the worst case all of them are queried in the correct order (so a single loop is enough). However, links are used to skip nodes which don't need to be evaluated. When the current node is a leaf node, the actual triangle intersections are performed.

Image taken from here. The associated paper and source code is also on Prof. Toshiya Hachisuka's page. The same concept is also described in this paper referenced in the slides.
#3 BVH Tree with Hit and Miss Links
I had to extend the data which is pushed to the shader with the links. Also some offline fiddling was required to store the tree correctly. At first I tried using a while loop (loop until box_index_next is -1) which resulted in a crazy slowdown again. Anyway, the following works reasonably fast:
int box_index_next = 0;
for (int box_index = 0; box_index < boxes_count; box_index++) {
    if (box_index != box_index_next) {
        continue;
    }
    bool hit = intersect_box(boxes[box_index], ray);
    bool leaf = boxes[box_index].ids_count > 0;
    if (hit) {
        box_index_next = boxes[box_index].links.x; // hit link
    } else {
        box_index_next = boxes[box_index].links.y; // miss link
    }
    if (hit && leaf) {
        uint a = boxes[box_index].ids_offset;
        uint b = a + boxes[box_index].ids_count;
        for (uint j = a; j < b; j++) {
            uint triangle_id = triangle_references[j];
            // triangle intersection code ...
        }
    }
}
This code is about 3x slower than the fast, but flawed implementation #1. This is somewhat expected, now the speed depends on the actual tree, not on the gpu optimization. Consider, for example, a degenerate case where triangles are aligned along an axis: a ray in the same direction might intersect with all triangles, then all tree leaves need to be evaluated.
Prof. Toshiya Hachisuka proposes a further optimization for such cases in his sildes (page 36 and onward): One stores multiple versions of the BVH tree, spatially sorted along x, -x, y, -y, z and -z. For traversal the correct version needs to be selected based on the ray. Then one can stop the traversal as soon as a triangle from a leaf is intersected, since all remaining nodes to be visited will be spatially behind this node (from the ray point of view).
Once the BVH tree is built, finding the links is quite straightforward (some python code below):
class NodeAABB(object):
    def __init__(self, obj_bounds, obj_ids):
        self.children = [None, None]
        self.obj_bounds = obj_bounds
        self.obj_ids = obj_ids
    def split(self):
        # split recursively and create children here
        raise NotImplementedError()
    def is_leaf(self):
        return set(self.children) == {None}
    def build_links(self, next_right_node=None):
        if not self.is_leaf():
            child1, child2 = self.children
            self.hit_node = child1
            self.miss_node = next_right_node
            child1.build_links(next_right_node=child2)
            child2.build_links(next_right_node=next_right_node)
        else:
            self.hit_node = next_right_node
            self.miss_node = self.hit_node
    def collect(self):
        # retrieve in depth first fashion for correct order
        yield self
        if not self.is_leaf():
            child1, child2 = self.children
            yield from child1.collect()
            yield from child2.collect()
After you store all AABBs in an array (which will be sent to the GPU) you can use hit_node and miss_node to look up the indices for the links and store them as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With