Encoding and Evolution
In our applications when data format or schema changes, we often need to change the application code. In large-scale apps, if the change is in the server-side application, we might need to do a rolling upgrade to deploy the change on a few nodes of our cluster to see if it is working smoothly before deploying it to the whole cluster. If the change is in the client-side application, this means that we are under the mercy of the user who may not install the new version for some time.
That means that old/new data schema can coexist with old/new application versions at some point in time. That means we need to have the following:
Formats for Encoding Data
Programs work with data in two different representations:
The translation from the in-memory structures to the sequence of bytes representation is called encoding, serialization, or marshalling. The opposite translation is called decoding, deserialization, or unmarshalling.
Built-in encoding functions in languages tie us to the same programming language for decoding. The built-in functions, like Java’s built-in serialization function, have bad performance. Besides that, they also pose security concerns as arbitrary classes need to be instantiated on the decoding side. If an attacker can send arbitrary byte sequences to be decoded by our application, they will be able to run arbitrary code in our system, doing harmful things. So using the language's built-in functions is a bad idea for encoding/decoding.
JSON and XML
These are standardized encodings that can be used by many programming languages, but they have some downsides. In XML, there is no specific difference between a string and a number. This is not an issue in JSON, but we can have number problems in JSON due to the decoding/target language. For example, in JavaScript, very large numbers >64 bits are rounded, which may result in bugs. This is why Twitter sends its tweet IDs as strings to avoid that and preserve the actual value.
CSV has problems like: what if the value of a column contains a comma or newline?
Despite all of these problems, JSON, XML, and CSV are good enough—especially when the organizations agree on the transmitted data, which can eliminate any bugs.
Binary Encoding
We can binary encode JSON to transmit it over the network using MessagePack, Thrift BinaryProtocol, Thrift CompactProtocol, or Protocol Buffers.
For example, this JSON is 81 bytes:
{
"username": "ahmed123",
"email": "ahmed@example.com",
"age": 27,
"is_active": true
}
We can encode it in just 66 bytes by removing the whitespaces using MessagePack.
In Thrift BinaryProtocol, Thrift CompactProtocol, and Protocol Buffers, we prepare a schema in the sender app side and in the receiver app side. This enables us to generate the encoding code in the sender side and the decoding code in the receiver side, regardless of the programming languages used.
In Thrift and Protocol Buffers, preserving the tag numbers and handling the required fields and field type changes correctly is crucial for preserving backward and forward compatibility.
For example, changing a required field to be repeated in Protocol Buffers preserves backward compatibility (it will be a list with a single item) and forward compatibility (at least the code will not break, as old code will read only the last item in the list of the new data).
AVRO
AVRO is the most compact. We usually use it with big data. It can encode the same JSON above in only 33 bytes. It uses variable-size numbers like CompactProtocol, but it doesn't use type annotations in the encoded data, so the encoded data is just a sequence of bytes in a certain order.
Recommended by LinkedIn
The differences between the reader's and writer's schema are resolved by AVRO using the schema field names to handle the parsing correctly even if the order of the fields has changed.
The writer schema could be encoded at the beginning of large files. It could also be specified in the database at the beginning of every record, with all the schemas saved in the database for ease of retrieval to resolve conflicts. In the case of network transmission of encoded data, schemas could be negotiated during connection setup.
Using default values also helps preserve compatibility.
Code generation is not important in dynamically typed languages, but it makes sense with statically typed languages because of compile-time type checks, IDE autocompletion, and constructing in-memory efficient structures to save data.
In General, Binary Formats Have Nice Properties Over Textual Formats Like JSON:
Modes of Data Flow Between Different Processes
Dataflow via Databases
In case of rolling upgrades, we need to take care of compatibility (forward/backward). For example, old data was written using old code and schema. The new code using the new schema shouldn't break. Also, extra schema columns written by new code with new schema should not be lost when decoded and re-encoded with old code. This can be handled using default values and through application-level handling.
Dataflow Through Services: REST and RPC
Web Services
REST is a design philosophy that makes use of HTTP features like content negotiation, authentication, caching, and so on. We can use definition formats like OpenAPI (Swagger) to describe RESTful APIs.
SOAP is a protocol used over HTTP but without using HTTP features. It is used with XML, and the web service is described using Web Service Description Language (WSDL), which enables code generation that allows the client to use local method calls to access remote services. This is useful with statically typed languages.
In case of compatibility-breaking changes, the service provider ends up maintaining multiple versions of the service side-by-side (e.g., using API versioning).
RPC
RPC was supposed to make remote calls identical to local calls, but this is impossible due to network restrictions. RPC is still used for inter-organization service calls because it is faster than JSON with REST. But REST is better for experimentation and debugging, as we can make calls using browsers and HTTP clients like Postman—not necessarily with generated code.
Message-Passing Dataflow
Benefits:
The distributed actor model frameworks make use of message brokers.
O u t s t a n d i n g 💪