Using Sagas as a Pattern for Distributed transactions
Definition - What does a Distributed Transaction mean?
A distributed transaction includes one or more instructions that, individually or as a group, transact on two or more distinct nodes in a network.
Two phase Commit(2PC) in a Distributed System
In a 2PC the co-ordinator sends commit/abort message to each of its resources and waits for the acknowledgement response from all of the resources. The co-ordinator then completes/undoes a transaction if it receives acknowledgment response from all of its contributing resources.
Why 2PC is not a viable solution?
- Co-ordinator acts as a single point of failure.
- During transactions locks are imposed on resources which brings down the overall throughput of the system.
- In the worst case scenario, the retry attempts may lead to O(n^2) complexity
What is a Saga ?
"Long lived transactions (LLTs) hold on to database resources for relatively long periods of time, significantly delaying the termination of shorter and more common transactions. To alleviate these problems we propose the notion of a saga. A LLT is a saga if it can be written as a sequence of transactions that can be interleaved with other transactions. The database management system guarantees that either all the transactions in a saga are successfully completed or compensating transactions are run to amend a partial execution." -- Hector Garcaa-Molrna and Kenneth Salem , 1987
A large unit of functionality (T) can be thought of as a saga, which can be considered as a collection of sub transactions T1, T2,T3....Tn each independently running in its own context. Sub-transaction should not have a dependency on one another and they may or may not be ACID(Atomic,Consistent,Isolated,Durable) compliant. In a micro-service based architecture, a aggregator service(T)which is depending on the responses of other services (T1,T2,T3,...Tn) can be thought of as a saga comprising of sub resources/services.
Each transaction/request (Ti) in the saga has an equivalent compensating transaction/request (Ci). In other words, C1 semantically compensates the effect of T1. Ci effectively is an abort or rollback transaction/request which undoes the effect of Ti.
So in a typical scenario, execution of T1,T2,T3,...Tn means that the saga was successfully executed without any compensating actions. Or, there were compensating actions/requests Ci for every completed sub transactions Ti, in the event of an abort saga request.
How to execute a Saga ?
Each transaction/request in a saga is maintained by a Saga Execution Component(SEC). SEC is responsible for executing each of the sub transaction/requests and in the event of failure of any of the sub transaction/request(Ti), it can execute the corresponding compensating transaction/request(Ci). Typically, SEC can be a thought of as a state machine or event store managing events of each sub transactions/requests and compensating transactions/requests. So ideally, SEC should be running as a highly available, scalable process within its own separate context.
Logging saga events
Its important to have durable log streams of saga events which typically looks somewhat like this
Begin Saga:
Begin Ti:
End Ti
Begin Ci:
End Ci
End Saga
Backward Recovery
In backward recovery compensating transactions/requests try to restore the sub transactions/requests by nullifying it. So, basically compensating transactions/requests rollback the effect of the sub transactions/requests.
When to trigger Compensating requests ?
- Whenever there is an abort/error saga request from any of the services to SEC.
- Whenever any of the request fails due to unavailability/error.
- SEC itself fails , triggering compensating requests on tear down.
What if Compensating requests fail ?
Since we are executing transactions over a distributed network, we should take into consideration that even compensating requests may fail. For that, it is critical that compensating requests are idempotent (an operation that has no additional effect if it is called more than once with the same input parameters), which implicitly mandates that compensating requests should be stateless. Compensating requests should be retried until we get a success state.
Forward Recovery
In case of pure forward recovery there are no compensating transactions rather the saga retries by restarting it from the beginning in the event of error/abort saga. So ideally in a pure forward recovery saga, each transaction/request(Ti) has to be idempotent.
Backward and Forward Recovery
Backward and Forward recovery in a saga can can be designed with proper save points.
Let's says a saga executes transactions/requests in the following order T1, T2, followed by a save-point command, and then transaction/request T3. Then during the execution of transaction/request T4 the system crashes. When the system recovers then it must must first perform a backward recovery to the save-point (aborting T4 and running C3). After ensuring that the code for running T3, T4, is available, the SEC can restart the saga in forward recovery mode for T1,T2.
Sub transaction are not depending on t1, t2 , t3....tn then in case of rollback or backwards recovery it could have stale data ?