Taming Asynchrony: How Duva Solved a Node-to-Node Race Condition

Taming Asynchrony: How Duva Solved a Node-to-Node Race Condition

In the world of concurrent programming, the Actor Model is a powerful paradigm. At its core, it simplifies concurrency by treating everything as an "actor." These actors have private state, don't share memory, and communicate only by sending asynchronous messages to each other. Here in the Duva project, we leverage this model heavily.


The Upside: Why We Use the Actor Model

The benefits are significant. Actors allow for massive parallelism, eliminate a whole class of bugs related to shared memory since an actor's internal state can't be touched directly from the outside. This encapsulation is what makes systems like Duva easier to reason about, even with many things happening at once.


The Downside: The Challenge of Asynchronous Coordination

However, the asynchronous, "fire-and-forget" nature of message passing can be a double-edged sword. When an actor sends a message, it introduces a new challenge: ensuring proper coordination for operations that must happen in a specific, reliable order. Notably, orchestrating multi-step operations across different actors — especially actors on different network nodes— is a big challenge.

You can't just send a message and assume the work is done. This brings us to the specific race condition we solved.


The Race Condition

When a new node joins the cluster, it uses a Gossip protocol for peer discovery, which is eventually consistence. The problem is when certain operations need to kick in only after this discovery is fully complete across the cluster.

"Rebalancing on join cluster" is a perfect example. Say, Node A joins the cluster via a seed node, B.

Node B is aware of A immediately, but what about Node C or D or E? Due to the nature of distributed system, Node C might not learn about Node A for a short period.

If a rebalancing operation is triggered instantly, an actor on Node C might receive a request to migrate data to Node A, a node it doesn't even know exists yet. This request would inevitably fail. This requires synchronization between the "cluster join" event and the "trigger rebalancing" action.


The Solution: A Callback Chain for Predictable Order

To solve this, we adopted an explicit coordination pattern: a chain of callbacks. This ensures that crucial steps happen in a guaranteed order, taming the unpredictable nature of asynchronous messages.

  1. Connect (Callback 1): The new node's actor initiates a connection. Instead of just assuming it works, it waits for a "connected" signal before proceeding. This is handled asynchronously in the background to avoid blocking.
  2. Synchronize (Callback 2): Upon receiving the "connected" signal, the actor doesn't start rebalancing yet. Instead, it triggers the next step: synchronizing its state with the cluster. It then waits for a "sync complete" signal.
  3. Rebalance (The Final Action): Only after the connection is confirmed and synchronization is complete does the final callback fire, safely triggering the data rebalancing.

By creating this explicit sequence, we ensure that rebalancing only happens when the entire system is ready for it. This small change makes Duva's scaling behavior more robust and predictable, turning a potential weakness of the Actor Model into a well-managed, reliable process.

To view or add a comment, sign in

More articles by Migo Lee

Others also viewed

Explore content categories