Taming Asynchrony: How Duva Solved a Node-to-Node Race Condition

Migo Lee

Published Jul 31, 2025

In the world of concurrent programming, the Actor Model is a powerful paradigm. At its core, it simplifies concurrency by treating everything as an "actor." These actors have private state, don't share memory, and communicate only by sending asynchronous messages to each other. Here in the Duva project, we leverage this model heavily.

The Upside: Why We Use the Actor Model

The benefits are significant. Actors allow for massive parallelism, eliminate a whole class of bugs related to shared memory since an actor's internal state can't be touched directly from the outside. This encapsulation is what makes systems like Duva easier to reason about, even with many things happening at once.

The Downside: The Challenge of Asynchronous Coordination

However, the asynchronous, "fire-and-forget" nature of message passing can be a double-edged sword. When an actor sends a message, it introduces a new challenge: ensuring proper coordination for operations that must happen in a specific, reliable order. Notably, orchestrating multi-step operations across different actors — especially actors on different network nodes— is a big challenge.

You can't just send a message and assume the work is done. This brings us to the specific race condition we solved.

Recommended by LinkedIn

Event-Based Asynchronous Programming

Kislay Verma 6 years ago

Asynchronous Programming: Async/Await Across Languages…

Sreenivas Chaparala 10 months ago

The Hidden Cost of ‘LLM-Native’ Programming

Jack Wu 1 month ago

The Race Condition

When a new node joins the cluster, it uses a Gossip protocol for peer discovery, which is eventually consistence. The problem is when certain operations need to kick in only after this discovery is fully complete across the cluster.

"Rebalancing on join cluster" is a perfect example. Say, Node A joins the cluster via a seed node, B.

Node B is aware of A immediately, but what about Node C or D or E? Due to the nature of distributed system, Node C might not learn about Node A for a short period.

If a rebalancing operation is triggered instantly, an actor on Node C might receive a request to migrate data to Node A, a node it doesn't even know exists yet. This request would inevitably fail. This requires synchronization between the "cluster join" event and the "trigger rebalancing" action.

The Solution: A Callback Chain for Predictable Order

To solve this, we adopted an explicit coordination pattern: a chain of callbacks. This ensures that crucial steps happen in a guaranteed order, taming the unpredictable nature of asynchronous messages.

Connect (Callback 1): The new node's actor initiates a connection. Instead of just assuming it works, it waits for a "connected" signal before proceeding. This is handled asynchronously in the background to avoid blocking.
Synchronize (Callback 2): Upon receiving the "connected" signal, the actor doesn't start rebalancing yet. Instead, it triggers the next step: synchronizing its state with the cluster. It then waits for a "sync complete" signal.
Rebalance (The Final Action): Only after the connection is confirmed and synchronization is complete does the final callback fire, safely triggering the data rebalancing.

By creating this explicit sequence, we ensure that rebalancing only happens when the entire system is ready for it. This small change makes Duva's scaling behavior more robust and predictable, turning a potential weakness of the Actor Model into a well-managed, reliable process.

To view or add a comment, sign in

Taming Asynchrony: How Duva Solved a Node-to-Node Race Condition

Migo Lee

The Upside: Why We Use the Actor Model

The Downside: The Challenge of Asynchronous Coordination

Recommended by LinkedIn

The Race Condition

The Solution: A Callback Chain for Predictable Order

More articles by Migo Lee

Others also viewed

The Programming Paradigm Shift is Here: Why Specifications Are the New Code

Claude Code as a Pair Programmer — An Architect's Honest Assessment

A Complete Guide to Asynchronous Programming in Node.js

Asynchronous programming with async and await - Part 2

Announcing Codestral: Mistral AI’s Cutting-Edge Code Generation Model

Seven Pillars of Full-Stack Programming

The Great Programming Paradigm Shift: Vibe vs Structure

Asynchronous Programming in Node.js: How to Master Callbacks, Promises, and Async/Await

Tokio Async APIS - Random notes

tokio::spawn() Vs Async block Vs Async func

Explore content categories

The Upside: Why We Use the Actor Model

The Downside: The Challenge of Asynchronous Coordination

Recommended by LinkedIn

The Race Condition

The Solution: A Callback Chain for Predictable Order

More articles by Migo Lee

Channel BackPressure Deadlock

Multi-Raft Architecture On RocksDB

mTLS: Cryptographic Trust in Distributed Systems

Object Safety, Simply(?) Explained

Understanding Future Bloat and Stack Overflows in Async Rust

Spurious Wakeup: When OS Pull Pranks On Your Async Runtime

At the Fork in the Road: Rust's .await vs. spawn Under the Hood

Does a Timeout Drop the Network Connection?

Scaling PostgreSQL

Rust IPC from Scratch: Mmap, Atomics, and Memory Layouts

Others also viewed

The Programming Paradigm Shift is Here: Why Specifications Are the New Code

Claude Code as a Pair Programmer — An Architect's Honest Assessment

A Complete Guide to Asynchronous Programming in Node.js

Asynchronous programming with async and await - Part 2

Announcing Codestral: Mistral AI’s Cutting-Edge Code Generation Model

Seven Pillars of Full-Stack Programming

The Great Programming Paradigm Shift: Vibe vs Structure

Asynchronous Programming in Node.js: How to Master Callbacks, Promises, and Async/Await

Tokio Async APIS - Random notes

tokio::spawn() Vs Async block Vs Async func

Explore content categories