Encoding and Evolution

Encoding and Evolution

In our applications when data format or schema changes, we often need to change the application code. In large-scale apps, if the change is in the server-side application, we might need to do a rolling upgrade to deploy the change on a few nodes of our cluster to see if it is working smoothly before deploying it to the whole cluster. If the change is in the client-side application, this means that we are under the mercy of the user who may not install the new version for some time.

That means that old/new data schema can coexist with old/new application versions at some point in time. That means we need to have the following:

  • Backward compatibility: the application’s new version ability to handle old data format/schema which is written by older version of the application.
  • Forward compatibility: the application’s old version ability to handle data written by newer version. This is tricky, as the old version needs to somehow ignore the additions that may be written by newer versions to stay functional.


Formats for Encoding Data

Programs work with data in two different representations:

  1. In-memory data kept in objects, structs, lists, arrays, hash tables, trees. These structures are optimized for efficient access and manipulation by CPU.
  2. When you want to write data or send it over the network, you have to encode this data as a self-contained sequence of bytes like JSON.

The translation from the in-memory structures to the sequence of bytes representation is called encoding, serialization, or marshalling. The opposite translation is called decoding, deserialization, or unmarshalling.

Built-in encoding functions in languages tie us to the same programming language for decoding. The built-in functions, like Java’s built-in serialization function, have bad performance. Besides that, they also pose security concerns as arbitrary classes need to be instantiated on the decoding side. If an attacker can send arbitrary byte sequences to be decoded by our application, they will be able to run arbitrary code in our system, doing harmful things. So using the language's built-in functions is a bad idea for encoding/decoding.


JSON and XML

These are standardized encodings that can be used by many programming languages, but they have some downsides. In XML, there is no specific difference between a string and a number. This is not an issue in JSON, but we can have number problems in JSON due to the decoding/target language. For example, in JavaScript, very large numbers >64 bits are rounded, which may result in bugs. This is why Twitter sends its tweet IDs as strings to avoid that and preserve the actual value.

CSV has problems like: what if the value of a column contains a comma or newline?

Despite all of these problems, JSON, XML, and CSV are good enough—especially when the organizations agree on the transmitted data, which can eliminate any bugs.


Binary Encoding

We can binary encode JSON to transmit it over the network using MessagePack, Thrift BinaryProtocol, Thrift CompactProtocol, or Protocol Buffers.

For example, this JSON is 81 bytes:

{
  "username": "ahmed123",
  "email": "ahmed@example.com",
  "age": 27,
  "is_active": true
}        

We can encode it in just 66 bytes by removing the whitespaces using MessagePack.

In Thrift BinaryProtocol, Thrift CompactProtocol, and Protocol Buffers, we prepare a schema in the sender app side and in the receiver app side. This enables us to generate the encoding code in the sender side and the decoding code in the receiver side, regardless of the programming languages used.

  • The same JSON above can be encoded in just 59 bytes using BinaryProtocol by removing the keys from the encoded byte sequence and referring to them by the tags in the schema definitions.
  • It can be encoded in only 34 bytes using CompactProtocol as, besides removing the JSON keys, it uses as many bytes as needed for numbers (not necessarily the full 8 bytes).
  • The same JSON can be encoded in 33 bytes using Protocol Buffers, which is very close to CompactProtocol.

In Thrift and Protocol Buffers, preserving the tag numbers and handling the required fields and field type changes correctly is crucial for preserving backward and forward compatibility.

For example, changing a required field to be repeated in Protocol Buffers preserves backward compatibility (it will be a list with a single item) and forward compatibility (at least the code will not break, as old code will read only the last item in the list of the new data).


AVRO

AVRO is the most compact. We usually use it with big data. It can encode the same JSON above in only 33 bytes. It uses variable-size numbers like CompactProtocol, but it doesn't use type annotations in the encoded data, so the encoded data is just a sequence of bytes in a certain order.

The differences between the reader's and writer's schema are resolved by AVRO using the schema field names to handle the parsing correctly even if the order of the fields has changed.

The writer schema could be encoded at the beginning of large files. It could also be specified in the database at the beginning of every record, with all the schemas saved in the database for ease of retrieval to resolve conflicts. In the case of network transmission of encoded data, schemas could be negotiated during connection setup.

Using default values also helps preserve compatibility.

Code generation is not important in dynamically typed languages, but it makes sense with statically typed languages because of compile-time type checks, IDE autocompletion, and constructing in-memory efficient structures to save data.


In General, Binary Formats Have Nice Properties Over Textual Formats Like JSON:

  1. They are smaller in size because we remove spaces and property names.
  2. Schema acts as good documentation.
  3. Code generation for statically typed languages enables compile-time type checks.
  4. Keeping a database of schemas enables us to do better forward/backward compatibility.


Modes of Data Flow Between Different Processes

  1. Via databases
  2. Via service calls (REST, RPC)
  3. Asynchronous messages


Dataflow via Databases

In case of rolling upgrades, we need to take care of compatibility (forward/backward). For example, old data was written using old code and schema. The new code using the new schema shouldn't break. Also, extra schema columns written by new code with new schema should not be lost when decoded and re-encoded with old code. This can be handled using default values and through application-level handling.


Dataflow Through Services: REST and RPC

Web Services

REST is a design philosophy that makes use of HTTP features like content negotiation, authentication, caching, and so on. We can use definition formats like OpenAPI (Swagger) to describe RESTful APIs.

SOAP is a protocol used over HTTP but without using HTTP features. It is used with XML, and the web service is described using Web Service Description Language (WSDL), which enables code generation that allows the client to use local method calls to access remote services. This is useful with statically typed languages.

In case of compatibility-breaking changes, the service provider ends up maintaining multiple versions of the service side-by-side (e.g., using API versioning).


RPC

RPC was supposed to make remote calls identical to local calls, but this is impossible due to network restrictions. RPC is still used for inter-organization service calls because it is faster than JSON with REST. But REST is better for experimentation and debugging, as we can make calls using browsers and HTTP clients like Postman—not necessarily with generated code.


Message-Passing Dataflow

Benefits:

  • Preserves messages in the broker (like Kafka, RabbitMQ, etc..), so they don't get lost if the receiver is busy or down.
  • It is used for asynchronous communication, as the publisher sends the message and forgets about it.
  • We can allow request/response data flow if the sender sends a message to a topic, then the receiver receives the message from that topic, and then the receiver sends an acknowledgment message to another topic that the sender is subscribed to.

The distributed actor model frameworks make use of message brokers.

To view or add a comment, sign in

More articles by Muhammad Hamed

Others also viewed

Explore content categories