Understanding Idempotency in Distributed Systems

Explore top LinkedIn content from expert professionals.

Summary

Understanding idempotency in distributed systems means ensuring that repeating an action—like retrying a payment or rerunning a data pipeline—won’t cause duplicate results or errors. Simply put, an operation is idempotent if doing it once or many times has exactly the same effect, which is crucial for reliable payments and data processing.

Use unique keys: Always generate a unique transaction or request key so that the system can tell if an action has already been processed and avoid double charges or data duplication.
Store outcomes safely: Save the result of each transaction or operation in a durable database, not just temporary memory, to make sure you can recover the correct outcome even if something fails or restarts.
Pick the right pattern: Choose a writing method—like merge, partition overwrite, or atomic swap—that matches your system’s needs and prevents duplicate entries when jobs are retried or backfilled.

Summarized by AI based on LinkedIn member posts

Tannika Majumder

Senior Software Engineer at Microsoft | Ex Postman | Ex OYO | IIIT Hyderabad

49,243 followers 7mo
Report this post
I’ve used this exact scenario to explain Idempotency to over 30+ my juniors in technical discussions. Every single time, it clicks instantly, and the understanding never leaves them which is a reward for me as a Sr. Engineer. So let’s break it down: You've filled your cart during the Flipkart/Amazon Great Indian Festival Sale. Laptop, headphones, gifts for the family, a solid ₹60,000 haul. You click "Place Order," enter your UPI PIN/CVV, and hit Pay. ...and then, due to the classic Indian broadband, Internet disconnects. The page spins forever. No confirmation screen. Now you’re wondering: "Did the payment go through? Did I just lose ₹60,000? Should I try again? What if I get charged twice?!" But you check your bank SMS or UPI app. One deduction. Only one. You refresh the page later, and there's your order, confirmed. Here’s what happens behind the scenes when you click that button: 1. The Unique "Receipt" (Idempotency Key): The moment you click "Place Order," the backend generates the idempotency key, like a UUID (e.g., diwali_sale_<your_user_id>_<random_number>). This is your idempotency key. Think of it as a unique transaction receipt number. This key is attached to every payment request sent to the backend. 2. The First Payment Request: Your request: "Hey Flipkart backend, please charge me ₹60,000. Here's my receipt number: diwali_sale_123_xyz." The backend processes it: charges your card/UPI, creates an order, and most importantly, stores the result ("Success, Order ID: OD123") linked to that exact receipt number diwali_sale_123_xyz. 3. The Retry: The network fails. Your app/browser doesn't get a response. So, what does it do? It does the logical thing: it retries. It sends the exact same request again: "Hey Flipkart backend, please charge me ₹60,000. Here's my same receipt number: diwali_sale_123_xyz." 4. The Backend: This is where the magic happens. The backend doesn't just blindly process the payment again. It checks its database: "Have I seen this receipt number diwali_sale_123_xyz before?" If YES: It understands this is a duplicate request. It does NOT call the payment gateway again. Instead, it simply fetches the result of the first successful request ("Success, Order ID: OD123") and sends it back to you. If NO: It processes it as a new, unique request. This simple check guarantees that one unique key = one financial transaction. Always. No matter how many times you retry. ➤ Why This is a Big Deal at Scale?? Think about the Diwali sale traffic: – Millions of users hit "Place Order" at the same second. – Unreliable mobile networks across India are causing countless timeouts and retries. – Payment gateways (like Razorpay, BillDesk, PayU) are under extreme load, responding slowly. Without idempotency, this would be chaos. Double charges, triple charges, angry customers, and a PR nightmare.
No more previous content

No more next content
45 Comments
Like Comment
Raul Junco

Simplifying System Design

138,700 followers 5mo
Report this post
Most teams mess up idempotency because they treat it like a caching problem. It’s not. It’s a storage problem. Diego built a payments endpoint handling 5k charge requests/sec. Clients retry for minutes... sometimes hours. If he gets idempotency wrong, users get double-charged. Game over. Most engineers try something like: “Just throw the key in Redis with TTL.” Fast. Simple. And wrong. Because the second Redis evicts the key, or a TTL expires, or a node fails, you don’t just lose the key... you lose guarantees. Payments can’t live on “best effort.” Here’s the real play: Idempotency belongs in the source of truth. The database. Use a unique request_id. Do a conditional insert. If it’s new → process and save the result. If it conflicts → return the stored one. Atomically. Durably. Forever. No locks. No race conditions. No cache guessing. No “hope it’s still in memory.” It scales. It survives region failovers. It handles endless retries without breaking a sweat. Because when money moves, you don’t bet on a TTL. You bet on a constraint.
No more previous content

No more next content
52 Comments
Like Comment
Sai Sneha Chittiboyina

Lead Data Engineer | Snowflake| Microsoft Fabric | AWS AZURE & GCP Cloud Services | FHIR| Healthcare Data Expert | Databricks| BigQuery | Python | SQL | Epic | Kafka | Agentic AI | Langraph |GENAI|RAG|LLMs|Langchain

7,054 followers 2w
Report this post
My senior reviewed my first production pipeline in Databricks. He ran it twice on the same data. Row count doubled. He looked at me and said: "Your pipeline is not idempotent." I nodded like I knew what that meant. I had no idea. Here's what it means — and the 3 patterns I now use in every single pipeline. What is idempotency? A pipeline is idempotent if running it twice produces the same result as running it once. Why does this matter? → Databricks jobs get retried on failure → You backfill historical data → Someone reruns a job manually after a bug fix → ADF pipeline retries on transient errors If your pipeline is not idempotent, every retry duplicates data. Your downstream dashboards show wrong numbers. Your Delta table grows with phantom rows. And nobody knows when it started. The ticking time bomb df.write.format("delta").mode("append").save(path) This is the most common pattern I see in junior pipelines. And it is a ticking time bomb. First run: 100 rows written. Correct. Pipeline fails halfway. Databricks retries. Second run: 100 rows written again. Now you have 200 rows. Duplicates everywhere Pattern 1 — Delta MERGE (upsert) Best for: slowly changing data, fact tables with updates MERGE INTO silver_customers AS target USING new_data AS source ON target.customer_id = source.customer_id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * Run it 10 times. Same result every time. Existing rows updated. New rows inserted. No duplicates. Ever.― Pattern 2 — Partition Overwrite Best for: partitioned tables, daily batch loads spark.conf.set( "spark.sql.sources.partitionOverwriteMode", "dynamic") df.write.format("delta") .mode("overwrite") .option("partitionOverwrite", "true") .partitionBy("load_date") .save(path) Only the partitions present in df get overwritten. Rerun for 2024-01-15? It replaces that partition only. Other dates untouched. Idempotent by design. ―――――――――――――――――――― Pattern 3 — High-Watermark Tracking Best for: incremental loads from source systems Store the last processed timestamp in a Delta control table. Each run reads only records newer than the watermark. After writing, update the watermark. watermark = spark.sql( "SELECT max(last_processed) FROM control.watermarks WHERE table_name = 'sales'").collect()[0][0] new_data = source_df.filter( col("updated_at") > watermark) Run it twice on the same schedule? Second run finds zero new records. Writes nothing. Idempotent. ―――――――――――――――――――― The decision rule Data has natural unique key? → MERGE Partitioned by date, daily batch? → Partition Overwrite Incremental from source? → High-Watermark Never use plain append for production pipelines. Not without a dedup step before it. Drop it below 👇 #Databricks #C2C #C2H #DeltaLake #DataEngineering #ETL #ApacheSpark #Azure #PySpark #DataQuality #DataPipelines #Idempotency
No more previous content

No more next content
1 Comment
Like Comment
Vignesh Charan Raman

SDE 2 @ Amazon | Learning System Design in the era of Vibe coding 🚀

5,745 followers 3mo
Report this post
Why At-Least-Once is a Trap We spend hours designing for failure. But the most dangerous failure mode in a distributed system isn't a crash; it is a Timeout. We know that networks are unreliable. When a request times out, you face two General Problems: - Did the request fail to reach the server? - Did the server process it, but the response got lost? If you assume #1 and retry, but it was actually #2, you execute the operation twice. By default, robust systems retry on failure. This guarantees the message is delivered at least once. But for non-idempotent operations (like POST /charge), at-least-once easily becomes charged twice. The Solution: Idempotency To fix this, we decouple the Intent from the Execution, a pattern standard in Stripe and Kafka. - The Client generates a random unique ID for the intent: Idempotency-Key: 550abc8400.... The Server Logic: - Scope: The server combines (User_ID + Idempotency_Key) to ensure uniqueness. - Check: "Have I seen this pair in my store?" - If YES: Stop. Do not charge. Return the previous success response. - If NO: Charge the card. Save the key + result atomically. Now, the client can retry safely. The card is charged exactly once. Idempotency transforms "At-Least-Once" to "Exactly-Once," but You turn a stateless API into a Stateful one. You must store every key and its response payload. If two requests with the same key arrive at the exact same millisecond, you need Row Locking or INSERT ON CONFLICT to prevent a race condition. Most systems expire idempotency keys after ~24 hours. A retry at hour 25 risks a double charge. you use UUID v7, the key itself encodes time. You can reject obviously stale retries before hitting the database — an optimization, not a replacement. Idempotency isn’t about performance. It’s about correctness in a world where timeouts lie. #HardEngineering #DistributedSystems #Idempotency #SystemDesign #DDIA #API
No more previous content

No more next content
Like Comment
Arunkumar Palanisamy

Integration Architect → Senior Data Engineer | AI/ML | 19+ Years | AWS, Snowflake, Spark, Kafka, Python, SQL | Retail & E-Commerce

2,966 followers 2mo
Report this post
𝗠𝗼𝘀𝘁 𝘁𝗲𝗮𝗺𝘀 𝘀𝗮𝘆 𝘁𝗵𝗲𝗶𝗿 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀 𝗮𝗿𝗲 𝗶𝗱𝗲𝗺𝗽𝗼𝘁𝗲𝗻𝘁. 𝗙𝗲𝘄 𝗰𝗮𝗻 𝗲𝘅𝗽𝗹𝗮𝗶𝗻 𝘄𝗵𝗶𝗰𝗵 𝘄𝗿𝗶𝘁𝗲 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗺𝗮𝗸𝗲𝘀 𝘁𝗵𝗲𝗺 𝘀𝗮𝗳𝗲. "Make it idempotent" is easy to say. The real question is: which write pattern makes it safe? 𝗛𝗲𝗿𝗲 𝗮𝗿𝗲 𝟰 𝗽𝗮𝘁𝘁𝗲𝗿𝗻𝘀 - 𝗲𝗮𝗰𝗵 𝘄𝗶𝘁𝗵 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁 𝘁𝗿𝗮𝗱𝗲-𝗼𝗳𝗳𝘀: → 𝗗𝗘𝗟𝗘𝗧𝗘 + 𝗜𝗡𝗦𝗘𝗥𝗧 Drop the partition, reload it. Simple and deterministic. Works well for daily batch loads where the window is clear. Risk: brief gap between delete and insert. → 𝗠𝗘𝗥𝗚𝗘 / 𝗨𝗣𝗦𝗘𝗥𝗧 Match on a business key. Insert new, update existing. Handles late-arriving data well. Risk: requires a stable, unique key - and that key isn't always obvious. → 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝗢𝘃𝗲𝗿𝘄𝗿𝗶𝘁𝗲 Replace an entire partition atomically. No gap, no partial state. Common in Spark and lakehouse architectures. Risk: coarse granularity - you rewrite everything in that window. → 𝗔𝘁𝗼𝗺𝗶𝗰 𝗦𝘄𝗮𝗽 (𝗦𝗵𝗮𝗱𝗼𝘄 + 𝗦𝘄𝗮𝗽) Write to a staging table, validate, then swap. Zero downtime. Risk: requires platform support and adds operational steps. In systems I've worked with, defaulting to one pattern everywhere usually backfires — what works for a dimension table often breaks on high-volume fact tables. 𝗧𝗵𝗲 𝗱𝗲𝗰𝗶𝘀𝗶𝗼𝗻 𝗳𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸: Batch with clear windows → DELETE + INSERT Late arrivals or CDC → MERGE Lakehouse with partitioned tables → Partition Overwrite High-stakes tables with zero-downtime needs → Atomic Swap Landing zones should absorb duplicates freely. Serving layers - dashboards, feature stores, exports - should not. Idempotency isn't one pattern. It's choosing the right pattern for your constraints - and drawing a clear boundary between where replay is safe and where it isn't. Which write pattern does your team default to? ♻️ Repost to help others ➕ Follow Arunkumar Palanisamy for data engineering and integration insights
No more previous content

No more next content
18 Comments
Like Comment
sukhad anand

Senior Software Engineer @Google | Techie007 | Opinions and views I post are my own

105,798 followers 7mo
Report this post
Everyone loves to say: “Kafka gives you at-least-once delivery. No data loss. Reliable by design.” That’s true. But here’s the catch 👉 “at-least-once” almost always means more than once. 🔂 What really happens in Kafka - Kafka delivers a payment event to the consumer - Consumer charges the user’s card - Before committing the offset, consumer crashes - Kafka resends the same event on restart - Consumer charges the card again 💥 Kafka kept its contract. Our consumer logic didn’t. 🧨 The Dangerous Assumption Most teams assume “Kafka = reliable”. But Kafka only promises delivery semantics, not business semantics. Without idempotency, Kafka becomes a duplicate amplifier. ✅ How to build safely on Kafka Idempotency keys → Store a unique event ID; skip if already processed Transactional outbox → Atomically write DB changes + Kafka offset Exactly-once processing (EOS) → Use Kafka’s enable.idempotence & transactions (but understand the tradeoffs) 🔑 The Real Lesson Kafka doesn’t guarantee your payments won’t double. Kafka guarantees your messages won’t disappear. Exactly-once delivery is extremely hard to achieve in a distributed system. Idempotency is the only reality.

30 Comments
Like Comment
Qian Li, PhD

Co-founder & CEO at DBOS | South Bay Systems

5,983 followers 1y
Report this post
A common question I get is: "How to implement exactly-once execution?" The honest answer: it's impossible. However, with 𝐢𝐝𝐞𝐦𝐩𝐨𝐭𝐞𝐧𝐜𝐲, you can achieve it effectively. Idempotency is a powerful concept: many developers need it, but it is hard to get right. It means you can run an operation multiple times and get the same outcome as running it once. Combine 𝐢𝐝𝐞𝐦𝐩𝐨𝐭𝐞𝐧𝐜𝐲 with 𝐫𝐞𝐭𝐫𝐢𝐞𝐬, and you achieve effectively exactly once execution. 1️⃣ When working with external systems, you must rely on them to provide idempotency. Many APIs, like Stripe, help by providing idempotency guarantees. For example, attaching an 𝘐𝘥𝘦𝘮𝘱𝘰𝘵𝘦𝘯𝘤𝘺𝘒𝘦𝘺 to your API calls ensures Stripe can safely retry without repeating actions like charging a customer twice. 2️⃣ How do you build idempotency into your own API? It starts with a unique key to track whether an operation has already been completed. Here’s the magic: 𝐝𝐚𝐭𝐚𝐛𝐚𝐬𝐞𝐬 𝐦𝐚𝐤𝐞 𝐭𝐡𝐢𝐬 𝐬𝐢𝐦𝐩𝐥𝐞. Thanks to ACID properties, transactions are “all or nothing.” Either everything completes, or it rolls back. To implement this, use an idempotency key to check if the operation has already succeeded. During your transaction, store the key alongside the result. Next time, just check if the key exists. Since it’s all part of one atomic transaction, you’re guaranteed to avoid duplicates or partial states. This is how DBOS.transaction automatically adds idempotency to your transactional functions. It guarantees your functions execute exactly once—no matter how many retries or interruptions occur. Whether you’re handling inventory updates, reservations, or payments, DBOS ensures that your database operations generate side effects once and only once.
No more previous content

No more next content
5 Comments
Like Comment
Peter Kraft

Co-founder & CTO @ DBOS, Inc. | Build reliable software effortlessly

6,770 followers 1y
Report this post
How do you process events exactly-once? Well, the problem is that it's impossible: if you send a message and don't get an answer, you have no way of knowing whether the receiver is offline or just slow, so eventually you have no choice but to send the message again if you want it processed. So if exactly-once is impossible, how close can you get in practice? First, you need at-least-once delivery. If a message fails to deliver, the sender needs to retry it. That way, messages are eventually delivered as long as nothing fails permanently. Second, you need idempotence. Your event processing logic must be safe to invoke multiple times on the same event, in case an event is delivered multiple times. Third, you need to deal with timeouts. If your event processing code is long-running, you can't run it synchronously when you receive a message or the sender will time out on you. You need to acknowledge the message then process the event in the background. Fourth, you need durability. If you're processing an event asynchronously and something fails, you to be able to recover it or the event will be lost forever (you already acknowledged the event, so it won't be re-delivered). Put all four properties together, and you have a good approximation of exactly-once processing: you can properly handle failed deliveries, failed processing, duplicates, and timeouts to process each event exactly once. But it's not easy to do yourself! Under the hood, this is exactly how DBOS event receivers (like for Kafka) work, automatically handling the complexity of exactly-once processing. They generate a unique key from an event (for example, from a Kafka topic + partition + offset) and use it as an idempotency key for an asynchronous event processing workflow. That way, even when duplicates arrive and failures happen, your events are handled correctly.
No more previous content

No more next content
2 Comments
Like Comment
Aman Khurana

Senior Solutions Architect & Manager | Oracle Cloud ERP Programs & Integrations | Delivery Leadership, Governance & Production Stability

2,358 followers 5mo
Report this post
Day 8 – Designing idempotent Oracle Cloud ERP integrations (so retries aren’t scary) Between Fusion business events, ERP Integrations callbacks, and OIC retries/resubmits, most Oracle Cloud ERP landscapes are effectively “at-least-once” delivery. If we don’t design for that, we end up with duplicate suppliers, double-posted journals, or orders pushed twice to downstream systems. The way I approach it is to treat idempotency as a first-class design concern: same message + same key = same outcome, no side effects. 1️⃣ Understand the platform behaviour • Fusion / ERP Integrations and ESS jobs can be retried or re-submitted. • OIC provides error views and resubmit, and many teams also add scheduled retries. • Business events are not guaranteed “exactly once” – duplicates are possible, especially under retries or failover. 2️⃣ Pick a clear idempotency key • Usually: business identifier + source system [+ version/timestamp] • Example: SupplierNumber + Source + ObjectVersionNumber • Every message carries that key end-to-end so we can make a simple “seen this before?” decision. 3️⃣ Keep a small state store • I prefer a dedicated table in ATP/ADB with: key, last processed version, status, timestamp. • On each event/API call: • If the version is newer, we process and update the record. • If it’s the same or older, we treat it as a duplicate and no-op. 4️⃣ Make downstream updates safe to repeat • Where we control the target, we design “upsert” style updates or merges. • Where we don’t, we rely on the idempotency check and simply never call the target for duplicates. The question I ask in design reviews now is very simple: “What happens if this message is delivered twice?” If the answer is anything other than “nothing bad”, we keep designing. How are you handling idempotency and duplicates around Oracle Cloud ERP and OIC today? — Aman Khurana Senior Solution Architect – Oracle Cloud ERP & Integration #OracleCloudERP #OracleIntegration #OIC #BusinessEvents #FBDI #RESTAPI #SolutionArchitecture #Idempotency
No more previous content

No more next content
4 Comments
Like Comment
Dev Agarwal

SDE-2 @ AWS | Ex - Google

4,149 followers 1mo
Report this post
A producer sends a message to Kafka. The broker writes it successfully. But the acknowledgement never reaches the producer because of a network issue. So the producer retries. Now the same message could be written twice. In distributed systems, retries are normal - but duplicate events can break downstream systems. So how does Apache Kafka support 𝗘𝘅𝗮𝗰𝘁𝗹𝘆-𝗢𝗻𝗰𝗲 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴? 🔹 𝗧𝗵𝗲 𝗣𝗿𝗼𝗯𝗹𝗲𝗺: 𝗗𝘂𝗽𝗹𝗶𝗰𝗮𝘁𝗲 𝗠𝗲𝘀𝘀𝗮𝗴𝗲𝘀 Imagine this scenario: • Producer sends a message • Broker writes it successfully • Acknowledgement is lost due to a network issue • Producer retries Now the same message could be written twice. Without safeguards, consumers would process duplicate events. 🔹 𝗜𝗱𝗲𝗺𝗽𝗼𝘁𝗲𝗻𝘁 𝗣𝗿𝗼𝗱𝘂𝗰𝗲𝗿𝘀 Kafka solves the first part of this problem using idempotent producers. Each producer gets a Producer ID (PID). Every message carries a sequence number. If a retry sends the same message again, Kafka detects the duplicate sequence number and ignores the duplicate write. This ensures messages are written only once per partition, even when retries happen. 🔹 𝗧𝗿𝗮𝗻𝘀𝗮𝗰𝘁𝗶𝗼𝗻𝘀 𝗶𝗻 𝗞𝗮𝗳𝗸𝗮 Kafka extends this idea using transactions. Transactions allow producers to write multiple messages across partitions atomically. That means: • Either all writes succeed • Or none of them are committed This is especially useful in stream processing pipelines. 🔹 𝗔𝘁𝗼𝗺𝗶𝗰 𝗢𝗳𝗳𝘀𝗲𝘁 𝗖𝗼𝗺𝗺𝗶𝘁𝘀 Exactly-once processing also requires that consumers don’t reprocess messages after failures. Kafka solves this by committing consumer offsets and produced messages within the same transaction. So if a failure occurs: • Either both commits succeed • Or neither does This ensures a message is never processed twice. What I find fascinating about Kafka is how it layers multiple mechanisms together: • Replication for durability • Idempotent producers for deduplication • Transactions for atomicity Strong guarantees in distributed systems rarely come from a single feature. They emerge from carefully designed layers working together. 💬 Curious: In production systems, do you prefer • 𝘼𝙩-𝙡𝙚𝙖𝙨𝙩-𝙤𝙣𝙘𝙚 𝙙𝙚𝙡𝙞𝙫𝙚𝙧𝙮 𝙬𝙞𝙩𝙝 𝙞𝙙𝙚𝙢𝙥𝙤𝙩𝙚𝙣𝙩 𝙘𝙤𝙣𝙨𝙪𝙢𝙚𝙧𝙨 or • 𝙀𝙭𝙖𝙘𝙩𝙡𝙮-𝙤𝙣𝙘𝙚 𝙜𝙪𝙖𝙧𝙖𝙣𝙩𝙚𝙚𝙨 𝙡𝙞𝙠𝙚 𝙆𝙖𝙛𝙠𝙖 𝙥𝙧𝙤𝙫𝙞𝙙𝙚𝙨? #DistributedSystems #SystemDesign #ApacheKafka #Kafka #BackendEngineering #Microservices #EventDrivenArchitecture #SoftwareEngineering
No more previous content

No more next content
Like Comment

Understanding Idempotency in Distributed Systems

Summary

More in Understanding Advanced Computing

Explore categories