Binary Tree Serialization: Preserving Structure with Minimal Overhead
Credit - Myself

Binary Tree Serialization: Preserving Structure with Minimal Overhead

In memory binary trees are cool. Working with them are efficient in algorithms and you can make good performance improvements working with them. But how can you save a binary tree? How to serialize a binary tree so that we can bring them back in memory to work?

When working on database engines, we need tree serialization techniques as databases use B-Tree, binary tree heavily.

The Problem

Lets see a simple integer binary tree

Article content

In memory we have a structure to represent the node of a binary tree. Something like this.

Article content

So for a single node, we have a value that is stored for that node, and two pointers for the corresponding left and right children of this node. We can make this more generic for all types.

Article content
Added template parameter for generic binary tree node for storing any value

Much better now. Now there might be a enclosing class, that holds the root node of this tree. Something like this

Article content
A parent class holding the root node of this tree.

Now how do we go on to store this structure? And also with minimal overhead.

Probable solutions to this

There are many solutions and ideas by which we can serialize and store a binary tree. But some of the simplest are -

  • By storing two traversals - inorder, preorder or postorder.
  • By storing the Huffman Code for the node along with the node.

Imagine there are thousands of nodes in the tree. By the first solution not only we use double the amount of storage for the tree, but also to re-generate the tree using traversals is a expensive operation for a number of reasons.

By using Huffman encoding for the nodes, we store the code before the node data. By doing this we first read the code, then we know that where in the tree our node resides. But for this also we need to store the tree in some traversal type so that recreating the tree uses minimal overhead. One such traversal is the Level Order Traversal.

Level Order Traversal (LOT)

It says that we write the nodes of the binary tree, level by level, left to right. If we write the LOT of the above tree, we get -

1 2 3 4 5 6 7 8

Seems very simple, and it is quite simple. We can write one very easily with the help of queues.

Article content
LOT algorithm for binary tree

Our generic template based solution

For our node to be serializable, lets add an interface, and add some SFINAE logic to our "binary_tree_node" so that it can be only created with types that implement the serialization interface.

Our interface has a function, that should return an "iovec" that will have the serialized byte data.

Article content

Hmm, so now if we want to create a binary tree for type "Type", so that class needs to implement this serializable interface that returns an iovec.

For now lets create an integer type that will satisfy this condition so that we can create our tree on that.

Article content

So our int_node implements a "to_bytes()" to serialize, and "from_bytes()" to deserialize.

Huffman Encoding for binary tree nodes

So assuming we have some node in the tree "P" having a huffman code (HC) of "101", and we are trying to add a new node "N" to the left of "P". When we add to the left, the new node inherits the parent's code and adding 0 to LSB. So HC of "N" becomes "1010". If we would have added to the right, it would be "1011".

This is the same as if we multiply the parent's HC by 2 for left and multiply by 2 + 1 for right. So the algorithm is as follows-

If Parent Node's HC = C, then left node's HC = 2C, right node's HC = 2C+1

Assembling the code till now

So the final binary_tree_node class looks something like this (I'm skipping the obvious constructors and destructors)

Article content
Article content

Very nice. Now we if take a look at the "to_bytes()" function for the binary_tree_node, we handle the data as well as the HC of the node. Now the data is left to the user to decide how the data should be packed in the byte buffer. We write the HC first, then the size of the data then the data. So it becomes something like this

Article content

Serialization in action

To run this, we create our tree with int_node as we like. I'm just creating it as it was mentioned above in the image. Remember I've just handled the good case scenario just to depict how it can be done. Negative and corner cases can be handled by you as I know you can do that :)

Article content

Now to run the serialization I've written this neat handy serialize() function inside the binary_tree class that will call the level_order() function with a serializer consumer that will serialize the nodes one by one.

Article content

As we have dynamically allocated the buffer for the binary_tree node, we need to deallocate it to avoid leaks.

Okay to call this serialize() function for the tree, create any std::ostream type instance and call the serialize on that. Let's do that.

Article content

We didn't print anything on stdout so we do not see anything, but we see a file created in the current directory. So if we hexdump it, we can see the binary contents of the file we just wrote. Let's see that

Article content

So it is something, we still do not know if its right or wrong. We can tell that after we deserialize this to make the tree, and print the tree.

But looking at the size of the file 160 bytes. We had 8 nodes. Each node has a HC of 8 bytes + data. Size of int_node data is 8 bytes for data length + 4 bytes for integer data. So size of data == 12 bytes. So size of node = 8 (HC) + 12 (Data) == 20 bytes. And we have 8 nodes. So total amount of data written onto the file should be 8 * 20 = 160 bytes. And that is what we see on the ls output. So there's some validation for our code. But it's still not enough.

Let's take the deserialization onto the next part.

Godbolt link - https://godbolt.org/z/fnPo9K77v

Let me know what you guys think of this.

C ya!



To view or add a comment, sign in

More articles by Shantanu Banerjee

Others also viewed

Explore content categories