Workflow Idempotency

This post is a continuation from the workflow basics post

Idempotency of a workflow

Idempotency is the property of certain operations such that they can be applied multiple times without changing the result beyond the initial application. 

For example in HTTP, the methods GET, PUT and DELETE should be implemented in an idempotent manner. If you attempt to GET a resource multiple times, assuming that there are no other changes done the result should be the same. Similarly updating a resource using PUT or attempting to delete a resource multiple times with DELETE should result in the same effect.

Idempotency is a useful property since it helps with recovery in failures in distributed scenarios. If an operation has failed with a transient error it can be retried. If it was completed deterministically (either success or failure), then retrying it should not change the result anyways.

Messaging systems like SQS have “at-least-once delivery” guarantees for delivery of messages. That means that on rare occasions, a message can be sent multiple times. The Retry Pattern is recommended for HTTP clients to handle transient failures to improve stability of an application. These are some common scenarios where a client might trigger a workflow multiple times with idempotency expectations.


Let’s get back to AWS Step functions and look at how idempotency can be achieved.

Idempotency based on name

The API to start a new workflow Execution is StartExecution. To start a workflow execution the parameters to be provided are-

  • stateMachineArn - identifier for the workflow
  • name - identifier for the workflow execution
  • Input - input parameters for first state of workflow as a JSON string


StartExecution is built as an idempotent operation.

  • If StartExecution is called with the same name and input as a running execution, then the call will succeed with the same response of executionArn and startDate.
  • If StartExecution is called with an existing name (90 days period) and if the input is different or the execution is completed then an error signaling that the execution name exists is returned.

To build an idempotent operation based on workflows it is possible to use the name field. A parameter for the operation will need to be used as the name of the workflow execution. If the workflow execution throws an error with an existing name, it will be possible to lookup the previous execution via the DescribeExecution call and return the response. See an example of this pattern here.

Using the provided idempotency based on name is fairly simple. However this operation is limited to upto 90 days of retention. This pattern also works for idempotency at the scope of the entire workflow. It is not possible to retry parts of the steps for a workflow. See the following pattern for more complex scenarios.


Idempotency based on external state

For complex long running workflows, intermediate steps in a workflow can fail. It might be possible to restart a workflow that is able to resume from failed intermediate steps. Hence each state within the workflow should be idempotent in itself.


No alt text provided for this image

To make individual states in a workflow idempotent, you should extract the metadata about a workflow execution (or computation as in the diagram) to an external datastore like dynamodb. Each individual state within an execution can then look up the current metadata and then resume or skip the work of that state.

The diagram above is from an AWS talk - Under the Covers of AWS: Its Core Distributed Systems. The talk covers various primitives to build distributed systems including workflows.

With metadata stored in an external service, the name of the workflow execution does not matter. From the input, some unique identifiers will instead need to look up keys in the metadata store and each step can store additional information.

Using an external datastore adds complexity to the solution but it allows for more control. It also removes the 90 day limits on workflow idempotency of the previous pattern.


Next

  • Service Integration patterns with async flows

To view or add a comment, sign in

More articles by Rahul Revo

  • [tech] Building with AI editors

    In the previous post Announcing US Visa Bulletin Hub I shared the personal project I had been working on for some time.…

  • Announcing US Visa Bulletin Hub

    Welcome to US Visa Bulletin Hub, your new essential tool for navigating the complex world of U.S.

    3 Comments
  • [tech] Technical Primer on MCP

    cross-post from https://www.rahul.

    1 Comment
  • [tech] Early 2025 AI thoughts

    Background The video above by Andrej Karpathy is a fantastic overview of how Large Language Models (LLMs) work for a…

  • Peak efficiency

    While thousands of families are being impacted, I got this email in my inbox at around 9:46 AM today which is the…

  • [tech] Typespec for API first development

    About Typespec Typespec is a new language from Microsoft to define APIs. I've been prototyping with typespec for a few…

    1 Comment
  • Disagree and commit

    Decisions Decision making is an integral aspect of how companies and teams function. In software technology based…

  • Consistency models for service data

    Context Most software services deal with data, storing and retrieving it as needed. However, ensuring this data remains…

  • DIY Photo frame

    The new year holidays gave me some free time to build a DIY photo frame. We had a smaller google device but needed…

  • Change

    Thinking in systems One of the most insightful books I read this year is Thinking in Systems. The author explains what…

Others also viewed

Explore content categories