The heap is a classical data structure that puts a complete binary (or d-ary for the generalized version) tree into a contiguous array, storing the elements in breadth-first traversal order. In this way, all elements from the same level of the tree are stored contiguous one after the other.
I'm implementing a data structure which, under the hood, is a complete balanced tree of fixed degree d, and I want to store the tree in a contiguous form to free the space of node pointers. So I thought of putting the nodes into the breadth-first order used in heaps, but then I'm worried about cache performance of a typical search from the root down to a leaf, since at each level l, I jump over a lot of elements.
Is there a way to obtain compact contiguous representation of a d-ary complete tree, based on depth-first order instead?
This way, the nodes touched during a search of a leaf seem to me more likely to be found closer to each other. The problem then is how to retrieve the index of the parent and children of a node, but also I'm wondering which operations on the tree are in general efficient in this setting.
I'm implementing this thing in C++, in case it matters at all.
A Heap can be a complete binary tree or an Almost complete binary tree.
The DFS needs less memory as it only has to keep track of the nodes in a chain from the top to the bottom, while the BFS has to keep track of all the nodes on the same level. For example, in a (balanced) tree with 1023 nodes the DFS has to keep track of 10 nodes, while the BFS has to keep track of 512 nodes.
A heap is a tree-based data structure in which all the nodes of the tree are in a specific order. For example, if is the parent node of , then the value of follows a specific order with respect to the value of and the same order will be followed across the tree.
Level Order Traversal As BFS suggests, the breadth of the tree takes priority first and then move to depth.
For simplicity, I'm going to limit my discussion to binary trees, but what I say holds true for n-ary trees, too.
The reason heaps (and trees in general) are stored in arrays breadth-first is because it's much easier to add and remove items that way: to grow and shrink the tree. If you're storing depth-first, then either the tree has to be allocated at its maximum expected size, or you have to do a lot of moving items around when you add levels.
But if you know that you're going to have a complete, balanced, n-ary tree, then the choice of BFS or DFS representation is largely a matter of style. There isn't any particular benefit to one over the other in terms of memory performance. In one representation (DFS) you take the cache misses up front, and in the other case (BFS) you take the cache misses at the end.
Consider a binary tree with 20 levels (i.e. 2^20 - 1 items) that contains the numbers from 0 to (2^20 - 1). Each node occupies four bytes (the size of an integer).
With BFS, you incur a cache miss when you get the first block of the tree. But then you have the first four levels of the tree in cache. So your next three queries are guaranteed to be in cache. After that, you're guaranteed to have a cache miss when the node index is greater than 15, because the left child is at x*2 + 1 which will be at least 16 positions (64 bytes) away from the parent.
With DFS, you incur a cache miss when you read the first block of the tree. As long as the number you're searching for is in the left subtree of the current node, you're guaranteed not to get a cache miss for the first 15 levels (i.e. you continually go left). But any branch that goes right will incur a cache miss until you get down to three levels above the leaves. At that point, the entire subtree will fit into the cache and your remaining queries incur no cache misses.
With BFS, the number of cache misses is directly proportional to the number of levels you have to search. With DFS, the number of cache misses is proportional to the path taken through the tree and the number of levels you have to search. But on average, the number of cache misses you incur when searching for an item will be the same for DFS as for BFS.
And the math for computing the node positions is easier for BFS than it is for DFS, especially when you want to find the parent of a particular node.
It would appear that an is_leaf indicator is needed.
Because most everything is related by the level, we need a quick way of finding it which seems to depend on knowing if the node is a leaf or not.
The snippets below also assume the the position of the node relative to the parent is known... it's not pretty and pretty much useless since the whole point is to save space.
int parent_index(int index, int pos){
  if (pos == LEFT){
    return i-1;
  } else {
    return i - pow(2,num_levels - level(i));
  }
}
int left_child_index(int index){
  return i+1;
}
int right_child_index(int index){
  return i+pow(2,num_levels - level(index))
}
To get the level of a node, you could walk the left children until you get to a leaf.
The differences between tree indecies appears to resemble something similar to Pascal's triangle - so that may be helpful too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With