Encoding and Storage Formats

Grifith Mathew

Published Mar 22, 2025

Avro

Another binary encoding format that is interestingly different from Protocol Buffers and Thrift which we have covered in the last news letter. It was started in 2009 as a subproject of Hadoop, as a result of Thrift not being a good fit for Hadoop’s use cases.

Avro also uses a schema to specify the structure of the data being encoded. It has two schema languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily machine-readable.

The equivalent JSON representation of that schema is as follows:

In the byte sequence there is no tag and datatype and its the most compacted one (32 byte) as below .

To parse binary data, you read fields sequentially according to the schema, using it to determine each field's datatype. Correct decoding requires the reading code to use the exact schema that wrote the data; any mismatch leads to incorrect decoding.So, how does Avro support schema evolution?

The writer’s schema and the reader’s schema

With Avro, when an application wants to encode some data (to write it to a file or database, to send it over the network, etc.), it encodes the data using whatever version of the schema it knows about—for example, that schema may be compiled into the application. This is known as the writer’s schema.When an application wants to decode some data (read it from a file or database,receive it from the network, etc.), it is expecting the data to be in some schema, which is known as the reader’s schema. That is the schema the application code is relying on.

Key points:

Writer's and reader's schemas don't need to be identical, only compatible.
Avro resolves differences by comparing schemas and translating data.
To maintain compatibility:Add or remove fields with default values.Use union types for nullable fields (e.g., union { null, long, string }).
Changing field datatypes is possible if Avro can convert them.
Field name changes are backward compatible using aliases.
Adding a branch to a union type is backward compatible but not forward compatible.

Avro supports schema evolution by ensuring compatibility between different data producer and consumer versions. In databases, records may use different schemas over time, so a version number is added to each record, with a schema version list maintained. For network communication(REST,RPC calls), processes can negotiate the schema version during connection setup and use it throughout the session.

Beyond Encoding,data storage in the Big Data World

As big data evolves, Apache Avro remains a key tool for data serialization and management. Its ability to handle complex data structures and dynamic schemas ensures its role as a cornerstone in distributed systems and streaming platforms like Apache Kafka and Apache Pulsar.

Similarly, Apache Parquet, with its columnar storage format, excels in data compression and retrieval, making it the preferred choice for both batch and stream processing.

Parquet:-

Another data storage The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead.

Memory Bandwidth and Vectorized Processing

For data warehouse queries scanning millions of rows, disk-to-memory bandwidth is a key bottleneck, but efficient CPU usage is just as important. Columnar storage reduces data movement and optimizes CPU cache usage, enabling fast, tight-loop processing without costly function calls. Compressed columns fit more data into the L1 cache, allowing operations like bitwise AND and OR to run directly on them using vectorized processing, which leverages SIMD instructions for high performance. Libraries like Apache Arrow, Parquet, Polars, and TensorFlow NumPy utilize these techniques to accelerate data processing

Conclusion

Avro and Parquet are indispensable tools in the big data ecosystem. Avro excels in data serialization and schema evolution, while Parquet shines in efficient columnar storage for analytics. Understanding their strengths and use cases is crucial for building robust and scalable data solutions.Frameworks like Hudi leverage both—using Avro for efficient writes and asynchronously converting data to Parquet for optimized reads, ensuring a balance between performance and flexibility.

I hope you found this newsletter informative! Stay tuned for more data insights in my next edition.

Learn, Share, Grow

321 followers

+ Subscribe

Sumanth Munnangi 1y

Good read!

1 Reaction

To view or add a comment, sign in

Encoding and Storage Formats

Grifith Mathew

Avro

The writer’s schema and the reader’s schema

Beyond Encoding,data storage in the Big Data World

Parquet:-

Recommended by LinkedIn

Parquet Under the hood

Extending the Parquet Encoding Techniques

Memory Bandwidth and Vectorized Processing

Conclusion

Learn, Share, Grow

321 followers

More articles by Grifith Mathew

Others also viewed

Using Airbyte with Tabular

Big Data: The Bigger Picture #4

The Big 'Joint' Data Family

Demystifying Big Data: An Insight into MapReduce, Spark, and SQL Hive

🚀 Unlocking the Power of HDFS: Essential Insights into Architecture, Fault Tolerance, and Performance Optimization 🛠️

Next Big Thing in Big Data is Already Here!

Developing high performance drools service on big data using Apache Spark

Random thoughts on Druid

Delta Lake Format: Understanding Parquet under the hood.

Unleashing Big Data: The Power of MapReduce, Spark, and SQL (Hive)

Explore content categories

Avro

The writer’s schema and the reader’s schema

Beyond Encoding,data storage in the Big Data World

Parquet:-

Recommended by LinkedIn

Parquet Under the hood

Extending the Parquet Encoding Techniques

Memory Bandwidth and Vectorized Processing

Conclusion

Learn, Share, Grow

321 followers

More articles by Grifith Mathew

Training Large Language Models: From Foundations to Fine-Tuning

From Attention to Finesse: Unpacking the Mind of a Transformer

Why Your AI Reads Like a Detective With Eight Brains

The Neural Revolution: How Attention and Parallelism Redefined Language Models

Encoding and Schema Evolution: Adapting to Change

Optimizing Database Storage: From Hash Indexes to B-Trees

From Classical to Quantum Physics

Others also viewed

Using Airbyte with Tabular

Big Data: The Bigger Picture #4

The Big 'Joint' Data Family

Demystifying Big Data: An Insight into MapReduce, Spark, and SQL Hive

🚀 Unlocking the Power of HDFS: Essential Insights into Architecture, Fault Tolerance, and Performance Optimization 🛠️

Next Big Thing in Big Data is Already Here!

Developing high performance drools service on big data using Apache Spark

Random thoughts on Druid

Delta Lake Format: Understanding Parquet under the hood.

Unleashing Big Data: The Power of MapReduce, Spark, and SQL (Hive)

Explore content categories