Encoding and Storage Formats
Avro
Another binary encoding format that is interestingly different from Protocol Buffers and Thrift which we have covered in the last news letter. It was started in 2009 as a subproject of Hadoop, as a result of Thrift not being a good fit for Hadoop’s use cases.
Avro also uses a schema to specify the structure of the data being encoded. It has two schema languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily machine-readable.
The equivalent JSON representation of that schema is as follows:
In the byte sequence there is no tag and datatype and its the most compacted one (32 byte) as below .
To parse binary data, you read fields sequentially according to the schema, using it to determine each field's datatype. Correct decoding requires the reading code to use the exact schema that wrote the data; any mismatch leads to incorrect decoding.So, how does Avro support schema evolution?
The writer’s schema and the reader’s schema
With Avro, when an application wants to encode some data (to write it to a file or database, to send it over the network, etc.), it encodes the data using whatever version of the schema it knows about—for example, that schema may be compiled into the application. This is known as the writer’s schema.When an application wants to decode some data (read it from a file or database,receive it from the network, etc.), it is expecting the data to be in some schema, which is known as the reader’s schema. That is the schema the application code is relying on.
Key points:
Avro supports schema evolution by ensuring compatibility between different data producer and consumer versions. In databases, records may use different schemas over time, so a version number is added to each record, with a schema version list maintained. For network communication(REST,RPC calls), processes can negotiate the schema version during connection setup and use it throughout the session.
Beyond Encoding,data storage in the Big Data World
As big data evolves, Apache Avro remains a key tool for data serialization and management. Its ability to handle complex data structures and dynamic schemas ensures its role as a cornerstone in distributed systems and streaming platforms like Apache Kafka and Apache Pulsar.
Similarly, Apache Parquet, with its columnar storage format, excels in data compression and retrieval, making it the preferred choice for both batch and stream processing.
Parquet:-
Another data storage The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead.
Recommended by LinkedIn
Parquet Under the hood
Parquet stores data in RowGroups, which are divided into Column Chunks, each containing pages that store the actual data, dictionary entries, and index information. The columns are compressed and encoded for efficient storage and retrieval. Metadata is stored at both the file and RowGroup levels, including schema and compression details. Definition and repetition levels handle nullable and nested fields, optimizing for complex data structures.
Actual data in the pages can be encoded in different ways .
Bitmap Encoding:
Parquet utilizes bitmap encoding, a technique particularly effective in data warehouses. Bitmap encoding represents data as bit arrays, enabling efficient filtering operations using bitwise AND and OR operations, drastically speeding up query results.
Extending the Parquet Encoding Techniques
An example which combines all these techniques
Compression can be enabled in all the pages using compression schemas like snappy,gzip,lzo.
Memory Bandwidth and Vectorized Processing
For data warehouse queries scanning millions of rows, disk-to-memory bandwidth is a key bottleneck, but efficient CPU usage is just as important. Columnar storage reduces data movement and optimizes CPU cache usage, enabling fast, tight-loop processing without costly function calls. Compressed columns fit more data into the L1 cache, allowing operations like bitwise AND and OR to run directly on them using vectorized processing, which leverages SIMD instructions for high performance. Libraries like Apache Arrow, Parquet, Polars, and TensorFlow NumPy utilize these techniques to accelerate data processing
Conclusion
Avro and Parquet are indispensable tools in the big data ecosystem. Avro excels in data serialization and schema evolution, while Parquet shines in efficient columnar storage for analytics. Understanding their strengths and use cases is crucial for building robust and scalable data solutions.Frameworks like Hudi leverage both—using Avro for efficient writes and asynchronously converting data to Parquet for optimized reads, ensuring a balance between performance and flexibility.
I hope you found this newsletter informative! Stay tuned for more data insights in my next edition.
Good read!