Encoding and Schema Evolution: Adapting to Change

Encoding and Schema Evolution: Adapting to Change

"No man ever steps in the same river twice, for it's not the same river and he's not the same man.”- Heraclitus

The Need for Evolution

As applications evolve, so do their data formats. Traditional relational databases enforce a single schema, requiring migrations for changes. In contrast, schemaless databases allow mixed data formats over time. When schemas change, application code must adapt—though updates can’t always happen instantly, especially for client-side apps where users control updates.

To ensure smooth operation, systems must support backward compatibility (new code reading old data) and forward compatibility (old code handling new data). Achieving backward compatibility is straightforward, but forward compatibility requires careful design.

Encoding: Bridging Memory and Storage

Data exists in two forms:

  1. In-memory: Efficient structures like lists, trees, and hash tables.
  2. Serialized format: Encoded as byte sequences for storage or transfer.

Serialization (encoding) converts memory structures into portable formats, while deserialization (decoding) reverses this process. Languages offer built-in methods (e.g., Java’s Serializable, Python’s pickle), but standard formats ensure interoperability.

Text-Based Formats: JSON, XML, and CSV

Widely used and human-readable, these formats come with trade-offs:

  • JSON: Simple and browser-friendly but lacks strict numeric encoding.
  • XML: Too verbose, often complex.
  • CSV: No built-in schema, requiring external definitions.

The Shift to Binary Encoding

Text formats are inefficient for large-scale data. Binary alternatives offer compactness and speed:

  • MessagePack, BSON, and Smile optimize JSON.
  • Thrift and Protocol Buffers (by Facebook and Google, respectively) introduce schema-based binary encoding, reducing storage size significantly.

Message Back

Let’s look at an example of MessagePack, a binary encoding for JSON

{

"userName": "Martin",

"favoriteNumber": 1337,

"interests": ["daydreaming", "hacking"]

} 


Article content

The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the textual JSON encoding (with whitespace removed)

 Thrift and Protocol Buffers

Protocol Buffers was originally developed at Google, Thrift was originally developed at Facebook, and both were made open source in 2007–08. we will see how we can do much better, and encode the same record in just 32 bytes

 Both Thrift and Protocol Buffers require a schema for any data that is encoded. To encode the data in  Thrift, you would describe the schema in the Thrift interface definition language (IDL) like this:

 

struct Person {

1: required string userName,

2: optional i64 favoriteNumber,

3: optional list<string> interests

}

 The equivalent schema definition for Protocol Buffers looks very similar:


message Person {

required string user_name = 1;

optional int64 favorite_number = 2;

repeated string interests = 3;

}

 What does data encoded with this schema look like?

Thrift has two different binary encoding formats, called BinaryProtocol and CompactProtocol, respectively. Thrift and Protocol Buffers each come with a code generation tool that takes a schema definition like the ones shown above, and produces classes that implement the schema in various programming languages.Your application code can call this generated code to encode or decode records of the schema.

Starting with BinaryProtocol, this format uses field tags instead of storing the key itself, resulting in a total size of 59 bytes.


Article content

The Thrift CompactProtocol encoding is semantically equivalent to BinaryProtocol,but as you can see below, it packs the same information into only 34 bytes.

It does this by packing the field type and tag number into a single byte, and by using variable-length integers. Rather than using a full eight bytes for the number 1337, it is encoded in two bytes, with the top bit of each byte used to indicate whether there are still more bytes to come. This means numbers between –64 and 63 are encoded in one byte, numbers between –8192 and 8191 are encoded in two bytes, etc. Bigger numbers use more bytes.


Article content

Protocol Buffers

Protocol Buffers, which uses a single binary encoding format, encodes the same data as shown below. While its bit packing differs slightly, it remains quite similar to Thrift’s CompactProtocol, fitting the same record into 33 bytes.

Notably, in the previously shown schemas, fields were marked as either required or optional, but this distinction does not affect how they are encoded—there is no indication in the binary data whether a field was required. Instead, the required designation enforces a runtime check, triggering an error if the field is missing, which can help catch bugs.


Article content

Thrift’s CompactProtocol and Protocol Buffers achieve a 32-34 byte encoding for the same data that JSON stores in 81 bytes.

Schema Evolution in Thrift and Protocol Buffers

Schemas inevitably change over time—a process known as schema evolution. Thrift and Protocol Buffers are designed to handle these changes while maintaining both backward and forward compatibility.

  • Adding Fields: New fields can be introduced as long as they have unique tag numbers. Older versions of the code, unaware of these new fields, will simply ignore them. The parser uses datatype annotations to skip unrecognized fields, ensuring forward compatibility.
  • Removing Fields: Fields can only be removed if they were originally optional. The tag number of a removed field should never be reused, as doing so would lead to incompatibility issues.
  • Changing Data Types: Modifying a field’s datatype carries risks. Expanding a 32-bit integer to 64 bits is forward-compatible, as old data can be read without issue. However, backward compatibility is not guaranteed—older code using a 32-bit variable may truncate values from newer data.

By following these principles, Thrift and Protocol Buffers enable seamless schema evolution while minimizing disruptions.

What’s Next?

In upcoming newsletters, we’ll explore Avro ,parquet and real-world encoding use cases—stay tuned for deeper insights into the evolution of data formats!


 

To view or add a comment, sign in

More articles by Grifith Mathew

Others also viewed

Explore content categories