Encoding and Schema Evolution: Adapting to Change

Grifith Mathew

Published Mar 9, 2025

"No man ever steps in the same river twice, for it's not the same river and he's not the same man.”- Heraclitus

The Need for Evolution

As applications evolve, so do their data formats. Traditional relational databases enforce a single schema, requiring migrations for changes. In contrast, schemaless databases allow mixed data formats over time. When schemas change, application code must adapt—though updates can’t always happen instantly, especially for client-side apps where users control updates.

To ensure smooth operation, systems must support backward compatibility (new code reading old data) and forward compatibility (old code handling new data). Achieving backward compatibility is straightforward, but forward compatibility requires careful design.

Encoding: Bridging Memory and Storage

Data exists in two forms:

In-memory: Efficient structures like lists, trees, and hash tables.
Serialized format: Encoded as byte sequences for storage or transfer.

Serialization (encoding) converts memory structures into portable formats, while deserialization (decoding) reverses this process. Languages offer built-in methods (e.g., Java’s Serializable, Python’s pickle), but standard formats ensure interoperability.

Text-Based Formats: JSON, XML, and CSV

Widely used and human-readable, these formats come with trade-offs:

JSON: Simple and browser-friendly but lacks strict numeric encoding.
XML: Too verbose, often complex.
CSV: No built-in schema, requiring external definitions.

The Shift to Binary Encoding

Text formats are inefficient for large-scale data. Binary alternatives offer compactness and speed:

MessagePack, BSON, and Smile optimize JSON.
Thrift and Protocol Buffers (by Facebook and Google, respectively) introduce schema-based binary encoding, reducing storage size significantly.

Message Back

Let’s look at an example of MessagePack, a binary encoding for JSON

{

"userName": "Martin",

"favoriteNumber": 1337,

"interests": ["daydreaming", "hacking"]

}

The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the textual JSON encoding (with whitespace removed)

Thrift and Protocol Buffers

Protocol Buffers was originally developed at Google, Thrift was originally developed at Facebook, and both were made open source in 2007–08. we will see how we can do much better, and encode the same record in just 32 bytes

Both Thrift and Protocol Buffers require a schema for any data that is encoded. To encode the data in Thrift, you would describe the schema in the Thrift interface definition language (IDL) like this:

struct Person {

1: required string userName,

2: optional i64 favoriteNumber,

Recommended by LinkedIn

RAG Without Database vs RAG With Database: The…

KAMARAJ RANGASWAMY 2 weeks ago

Agentic AI deployment: Production-Ready using Docker…

Shanmugavelu Munivelu 5 months ago

YAML vs JSON: Choosing the Right Format for Your Data

Phaneendra Ganji 1 year ago

3: optional list<string> interests

}

The equivalent schema definition for Protocol Buffers looks very similar:

message Person {

required string user_name = 1;

optional int64 favorite_number = 2;

repeated string interests = 3;

}

What does data encoded with this schema look like?

Thrift has two different binary encoding formats, called BinaryProtocol and CompactProtocol, respectively. Thrift and Protocol Buffers each come with a code generation tool that takes a schema definition like the ones shown above, and produces classes that implement the schema in various programming languages.Your application code can call this generated code to encode or decode records of the schema.

Starting with BinaryProtocol, this format uses field tags instead of storing the key itself, resulting in a total size of 59 bytes.

The Thrift CompactProtocol encoding is semantically equivalent to BinaryProtocol,but as you can see below, it packs the same information into only 34 bytes.

It does this by packing the field type and tag number into a single byte, and by using variable-length integers. Rather than using a full eight bytes for the number 1337, it is encoded in two bytes, with the top bit of each byte used to indicate whether there are still more bytes to come. This means numbers between –64 and 63 are encoded in one byte, numbers between –8192 and 8191 are encoded in two bytes, etc. Bigger numbers use more bytes.

Protocol Buffers

Protocol Buffers, which uses a single binary encoding format, encodes the same data as shown below. While its bit packing differs slightly, it remains quite similar to Thrift’s CompactProtocol, fitting the same record into 33 bytes.

Notably, in the previously shown schemas, fields were marked as either required or optional, but this distinction does not affect how they are encoded—there is no indication in the binary data whether a field was required. Instead, the required designation enforces a runtime check, triggering an error if the field is missing, which can help catch bugs.

Thrift’s CompactProtocol and Protocol Buffers achieve a 32-34 byte encoding for the same data that JSON stores in 81 bytes.

Schema Evolution in Thrift and Protocol Buffers

Schemas inevitably change over time—a process known as schema evolution. Thrift and Protocol Buffers are designed to handle these changes while maintaining both backward and forward compatibility.

Adding Fields: New fields can be introduced as long as they have unique tag numbers. Older versions of the code, unaware of these new fields, will simply ignore them. The parser uses datatype annotations to skip unrecognized fields, ensuring forward compatibility.
Removing Fields: Fields can only be removed if they were originally optional. The tag number of a removed field should never be reused, as doing so would lead to incompatibility issues.
Changing Data Types: Modifying a field’s datatype carries risks. Expanding a 32-bit integer to 64 bits is forward-compatible, as old data can be read without issue. However, backward compatibility is not guaranteed—older code using a 32-bit variable may truncate values from newer data.

By following these principles, Thrift and Protocol Buffers enable seamless schema evolution while minimizing disruptions.

What’s Next?

In upcoming newsletters, we’ll explore Avro ,parquet and real-world encoding use cases—stay tuned for deeper insights into the evolution of data formats!

Learn, Share, Grow

321 followers

+ Subscribe

Murugavelu Dhandpani 1y

Good one Gritty👍

1 Reaction

To view or add a comment, sign in

Encoding and Schema Evolution: Adapting to Change

Grifith Mathew

The Need for Evolution

Encoding: Bridging Memory and Storage

Text-Based Formats: JSON, XML, and CSV

The Shift to Binary Encoding

Message Back

Thrift and Protocol Buffers

Recommended by LinkedIn

What does data encoded with this schema look like?

Protocol Buffers

Schema Evolution in Thrift and Protocol Buffers

What’s Next?

Learn, Share, Grow

321 followers

More articles by Grifith Mathew

Others also viewed

Article 3.1: Multi-Agent Framework with Firmware, Smart Library, and Semantic Search

Testing the Effectiveness of Using an AI Assistant in Writing SQL Queries in DBeaver

From Ideation to AI Web App in Under an Hour: My Vibe Coding Experience

Knowledge Graph-RAG Application (Built and Deployed)

MCP vs APIs: Standardizing Runtime Tooling for Large Language Models

Evaluating Snowflake for Generative AI Solutions: A Journey from Novice to Practitioner

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min HTML Data Extraction]

Distributed AI solutions with .NET Core, Aspire and Semantic Kernel

UTF-8 vs Protocol Buffers: Understanding Modern Data Encoding

Knowledge Vector Databases: Concepts, Sample Example, and How to Choose One

Explore content categories

The Need for Evolution

Encoding: Bridging Memory and Storage

Text-Based Formats: JSON, XML, and CSV

The Shift to Binary Encoding

Message Back

Thrift and Protocol Buffers

Recommended by LinkedIn

What does data encoded with this schema look like?

Protocol Buffers

Schema Evolution in Thrift and Protocol Buffers

What’s Next?

Learn, Share, Grow

321 followers

More articles by Grifith Mathew

Training Large Language Models: From Foundations to Fine-Tuning

From Attention to Finesse: Unpacking the Mind of a Transformer

Why Your AI Reads Like a Detective With Eight Brains

The Neural Revolution: How Attention and Parallelism Redefined Language Models

Encoding and Storage Formats

Optimizing Database Storage: From Hash Indexes to B-Trees

From Classical to Quantum Physics

Others also viewed

Article 3.1: Multi-Agent Framework with Firmware, Smart Library, and Semantic Search

Testing the Effectiveness of Using an AI Assistant in Writing SQL Queries in DBeaver

From Ideation to AI Web App in Under an Hour: My Vibe Coding Experience

Knowledge Graph-RAG Application (Built and Deployed)

MCP vs APIs: Standardizing Runtime Tooling for Large Language Models

Evaluating Snowflake for Generative AI Solutions: A Journey from Novice to Practitioner

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min HTML Data Extraction]

Distributed AI solutions with .NET Core, Aspire and Semantic Kernel

UTF-8 vs Protocol Buffers: Understanding Modern Data Encoding

Knowledge Vector Databases: Concepts, Sample Example, and How to Choose One

Explore content categories