Binary Tree Serialization: Preserving Structure with Minimal Overhead
In memory binary trees are cool. Working with them are efficient in algorithms and you can make good performance improvements working with them. But how can you save a binary tree? How to serialize a binary tree so that we can bring them back in memory to work?
When working on database engines, we need tree serialization techniques as databases use B-Tree, binary tree heavily.
The Problem
Lets see a simple integer binary tree
In memory we have a structure to represent the node of a binary tree. Something like this.
So for a single node, we have a value that is stored for that node, and two pointers for the corresponding left and right children of this node. We can make this more generic for all types.
Much better now. Now there might be a enclosing class, that holds the root node of this tree. Something like this
Now how do we go on to store this structure? And also with minimal overhead.
Probable solutions to this
There are many solutions and ideas by which we can serialize and store a binary tree. But some of the simplest are -
Imagine there are thousands of nodes in the tree. By the first solution not only we use double the amount of storage for the tree, but also to re-generate the tree using traversals is a expensive operation for a number of reasons.
By using Huffman encoding for the nodes, we store the code before the node data. By doing this we first read the code, then we know that where in the tree our node resides. But for this also we need to store the tree in some traversal type so that recreating the tree uses minimal overhead. One such traversal is the Level Order Traversal.
Level Order Traversal (LOT)
It says that we write the nodes of the binary tree, level by level, left to right. If we write the LOT of the above tree, we get -
1 2 3 4 5 6 7 8
Seems very simple, and it is quite simple. We can write one very easily with the help of queues.
Our generic template based solution
For our node to be serializable, lets add an interface, and add some SFINAE logic to our "binary_tree_node" so that it can be only created with types that implement the serialization interface.
Our interface has a function, that should return an "iovec" that will have the serialized byte data.
Hmm, so now if we want to create a binary tree for type "Type", so that class needs to implement this serializable interface that returns an iovec.
For now lets create an integer type that will satisfy this condition so that we can create our tree on that.
Recommended by LinkedIn
So our int_node implements a "to_bytes()" to serialize, and "from_bytes()" to deserialize.
Huffman Encoding for binary tree nodes
So assuming we have some node in the tree "P" having a huffman code (HC) of "101", and we are trying to add a new node "N" to the left of "P". When we add to the left, the new node inherits the parent's code and adding 0 to LSB. So HC of "N" becomes "1010". If we would have added to the right, it would be "1011".
This is the same as if we multiply the parent's HC by 2 for left and multiply by 2 + 1 for right. So the algorithm is as follows-
If Parent Node's HC = C, then left node's HC = 2C, right node's HC = 2C+1
Assembling the code till now
So the final binary_tree_node class looks something like this (I'm skipping the obvious constructors and destructors)
Very nice. Now we if take a look at the "to_bytes()" function for the binary_tree_node, we handle the data as well as the HC of the node. Now the data is left to the user to decide how the data should be packed in the byte buffer. We write the HC first, then the size of the data then the data. So it becomes something like this
Serialization in action
To run this, we create our tree with int_node as we like. I'm just creating it as it was mentioned above in the image. Remember I've just handled the good case scenario just to depict how it can be done. Negative and corner cases can be handled by you as I know you can do that :)
Now to run the serialization I've written this neat handy serialize() function inside the binary_tree class that will call the level_order() function with a serializer consumer that will serialize the nodes one by one.
As we have dynamically allocated the buffer for the binary_tree node, we need to deallocate it to avoid leaks.
Okay to call this serialize() function for the tree, create any std::ostream type instance and call the serialize on that. Let's do that.
We didn't print anything on stdout so we do not see anything, but we see a file created in the current directory. So if we hexdump it, we can see the binary contents of the file we just wrote. Let's see that
So it is something, we still do not know if its right or wrong. We can tell that after we deserialize this to make the tree, and print the tree.
But looking at the size of the file 160 bytes. We had 8 nodes. Each node has a HC of 8 bytes + data. Size of int_node data is 8 bytes for data length + 4 bytes for integer data. So size of data == 12 bytes. So size of node = 8 (HC) + 12 (Data) == 20 bytes. And we have 8 nodes. So total amount of data written onto the file should be 8 * 20 = 160 bytes. And that is what we see on the ls output. So there's some validation for our code. But it's still not enough.
Let's take the deserialization onto the next part.
Godbolt link - https://godbolt.org/z/fnPo9K77v
Let me know what you guys think of this.
C ya!