Orchestrating custom code with AWS Batch and Spot instances

In this post we take a look at running a batch process across a fleet of EC2 Spot instances using custom code in a container.

This does not "Spark" joy

I ran into a situation today where we needed to run a piece of custom code as a batch process, make this repeatable and serve for the lowest cost. This custom code was already baked and tested, so there wasn't any scope to refactor the code or port it to something that would be friendly to run on Amazon EMR, AWS Glue or anything Spark related.

So the proposal was to wrap the custom code library, build a container and use AWS Batch to run this on EC2 Spot instances. Below is prototype of the code to demonstrate this working, a list of the main components:

AWS Batch - orchestration service where we define our compute and job parameters

Amazon SQS - queue to batch up work for the custom code to run

Amazon S3 - this is used as the source/destination for reading raw data and then storing calculated results

No alt text provided for this image

I will be using AWS CDK to define the infrastructure we need to be deployed using CloudFormation.

In this example we have .NET code that is doing some work on a data set and storing the result. To simulate this I have taken an example from the AWS .NET SDK where we are querying some data from S3 and saving the results as JSON. We are not going to focus to much on the actual code here as it being used to create a arbitrary memory process on the container to read and write a stream. This is trying to represent the real world use case I have which is to take a string and return some JSON...simple enough?

So I wrote a wrapper around this function that does the following:

  • Read a SQS Message which has details on the source file to be read - this is batched by another function
  • Query S3 using S3 Select to gather a limited set of results
  • Pass the string into the custom code and return JSON
  • Upload the processed data to S3
  • Delete the message from SQS to indicate success

The source for this can be found here.

The Batch User Guide has a great example of building a Docker image and parallel running this across a number of jobs. I've used much of this to scaffold out the infrastructure I needed using CDK. This guide covers creating the image, pushing to ECR and the Job Definition you need to run jobs.

Using EC2 Spot instances is a great solution for this, this isn't a mission critical or time sensitive job so I can get some significant savings by using Spot instances to run my containers. AWS Batch manages spinning up and down the compute environment when the jobs need to run so that I am only paying for running the jobs.

So a quick highlight of the some of the CDK code we need:

const batchServiceRole = new iam.Role(this, 'batchServiceRole', {
  roleName: 'batchServiceRole',
  assumedBy: new ServicePrincipal('batch.amazonaws.com'),
  managedPolicyArns: [
    'arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole'
  ]
});


const spotFleetRole = new iam.Role(this, 'spotFleetRole', {
  roleName: 'AmazonEC2SpotFleetRole',
  assumedBy: new ServicePrincipal('spotfleet.amazonaws.com'),
  managedPolicyArns: [
    'arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole'
  ]
});


const batchInstanceRole = new iam.Role(this, 'batchInstanceRole', {
  roleName: 'batchInstanceRole',
  assumedBy: new iam.CompositePrincipal( 
      new ServicePrincipal('ec2.amazonaws.com'),
      new ServicePrincipal('ecs.amazonaws.com')),
  managedPolicyArns: [
    'arn:aws:iam::aws:policy/AmazonS3FullAccess'
  ]
});


new iam.CfnInstanceProfile(this, 'batchInstanceProfile', {
  instanceProfileName: batchInstanceRole.roleName,
  roles: [
    batchInstanceRole.roleName
  ]
});

Note - You need to create IAM Instance profile for the containers to execute under, just creating a role in CDK is not enough

const compEnv = new batch.CfnComputeEnvironment(this, 'batchCompute', {
  type: 'MANAGED',
  serviceRole: batchServiceRole.roleArn,
  computeResources: {
    type: 'SPOT',
    maxvCpus: 128,
    minvCpus: 0,
    desiredvCpus: 0,
    spotIamFleetRole: spotFleetRole.roleArn,
    instanceRole: batchInstanceRole.roleName,
    instanceTypes: [
      'optimal'
    ],
    subnets: [
      vpc.publicSubnets[0].subnetId,
      vpc.publicSubnets[1].subnetId,
      vpc.publicSubnets[2].subnetId
    ],
    securityGroupIds: [
      vpc.vpcDefaultSecurityGroup
    ]
  }
});

Important - "instanceRole" should be the IAM Role Name, NOT the IAM Role ARN, if you get this incorrect you will end up with a "INVALID" Batch Compute Environment and need to fix the name and recreate it from scratch.

new batch.CfnJobDefinition(this, 'batchJobDef', {
  jobDefinitionName: "s3select-dotnet",
  type: "container",
  containerProperties: {
    image: this.accountId + ".dkr.ecr.us-east-1.amazonaws.com/mjsdemo-ecr:latest",
    vcpus: 1,
    memory: 128,
    environment: [
      {
        name: "SQS_QUEUE_URL",
        value: sqsQueue.queueUrl
      },
      {
        name: "S3_QUERY_LIMIT",
        value: " LIMIT 10000"
      } 
    ]
  }
});

The nice thing here is that I can inject my SQS details from earlier on in the stack and pass them into the container as environment variables.

Then I jump to the Console or CLI and submit my job, something like the example:

aws batch submit-job --job-name jobdemo /
--job-queue batchJobSpot  --job-definition s3select-dotnet /
--array-properties size=100 /
--container-overrides environment=[{name=S3_QUERY_LIMIT,value=2000}] /
--generate-cli-skeleton

There we have it

No alt text provided for this image

So taking a look at some CloudTrail logs our Spot instance details are:

Start Time: 2019-06-11, 04:54:24 PM

End Time: 2019-06-11, 05:04:49 PM

Duration: 10 Min 25 Sec (625 Sec)

Instance Type: c4.4xlarge

Availability Zone: us-east-1a

On Demand Price: 0.7960

Spot Price: 0.2526

On Demand Cost: 0.138194444444444

Spot Cost: 0.043854166666667

Saving: 0.94340277777777

Or to put it another way 68% reduction!

This might seem small change but for 100's, 1,000's and 10,000's of job running over a month that small change will add up!

To view or add a comment, sign in

More articles by Matt Simpson

  • Streaming data with Amazon Kinesis

    I have been working on a number of different streaming strategies recently and one common pattern we see is getting…

  • Using AWS IAM with AWS Elastic Container Service (ECS)

    Security is job zero This is for all the people, just like me, that have committed secret keys to a public GitHub repo!…

    3 Comments
  • Partitioning with Cosmos DB

    Disclaimer Partitioning is *HARD*!! The below was a around a specific use case and was interesting so thought I would…

    1 Comment
  • Working with Azure Functions and Log Analytics

    Critical Logging It is critical to get your logging strategy defined and implemented, this can help in all aspects of…

  • Blazor - not just another web framework

    This time it's different..

    2 Comments
  • Prototype with PowerApps

    Last month we had the Microsoft FutureNow event in Auckland and I was lucky enough to get to present on using Cognitive…

    1 Comment
  • Machine Learning Marathon

    Running for Starship My daughter was born with a very rare medical condition, we didn't know this at the time and for…

    6 Comments
  • Kubernetes and Azure Functions

    Orchestration with the power of Functions (K8s^Fx) I have been a long time fan of both orchestrators and Functions as a…

    8 Comments

Explore content categories