Running one pod per node

Jishnu Kinwar

Published Jun 5, 2020

This is part of the series "Running Kubernetes at scale in AWS EKS". Links to all related articles are published there.

We have a very specialized ML application which was tuned for a specific EC2 instance type and the CPU and memory configuration matched very well with our requirements. So we decided to go with one pod per node. You may wonder "What can possibly be wrong with running one pod per node?"

A lot, if you combine it with where your docker registry is placed and how big are your images. When a new container (I use container/pod interchangeably as we also run one container per pod) is spun up on a node the image is pulled from the registry. A second container on the same node will not need to pull the image again. However if you run single container/pod per instance every time auto scale out happens the image is pulled from the registry. So the " imagePullPolicy=ifNotPresent" directive doesn't apply here as on a newly auto scaled node there is no image so has to be pulled.

Compound that with the docker registry sitting in one region of AWS and your cluster in another region. Our image size was upwards of 3.5G (#nvidia GPU container) and daily average of autoscaling servers being 1000 we were looking at 3.5TB daily and close to 100 TB a month of data transfer cost just to transfer docker images. Even the ECR registry comes out to be too expensive for that kind of network out.

We also played with the "--max-concurrent-downloads" in dockerd and by raising the number to 20 got additional 25% better download. But after 20 concurrent it showed a diminishing return as the number of layers on the docker file we had was 20.

In summary what can go wrong

Higher image transfer cost
More time taken to start pods
Long pod start time leads to EKS/Cluster scaler starting more nodes than really required

Solution

At the time of writing we were in a very tight crunch to come up with a solution to move quickly with roll out. We didn't want to go back and spend time on reducing the size of the image which we could have or localizing our docker registry in each region but we went with a very brute force solution:

Build the docker image as part of the AMI for the Cluster Agent/ASG to scale

We reduced our POD start up time that is time when cluster agent requested an instance in response to a scale out event to when readiness check passed from 10-12 min to 5 min. Part of the 5 in time is attaching GPU to the instance. On an average we are seeing 2.5 min for AWS to attach a GPU instance

This is part of the series "Running Kubernetes at scale in AWS EKS". Links to all related articles are published there

Nileshkumar Gohel 5y

Excellent and very informative.

Shyam Krishnaswamy 5y

Hello, here is a high quality tool to check your kubernetes infra for vulnerabilities and protect it from known and unknown attacks - https://github.com/deepfence/ThreatMapper. The added advantage is that the tool is free to use, and is constantly updated !!

Wenxiao He 5y

Nice trick

Manish Singh 5y

Nice!

See more comments

To view or add a comment, sign in

Running one pod per node

Jishnu Kinwar

Solution

More articles by Jishnu Kinwar

Others also viewed

Containerize Window Application & deploy into AWS Elastic Kubernetes Service (EKS)

Getting Started With Kubernetes on Amazon EKS | Part 1

AWS EKS Autoscaling using Kubernetes Cluster Auto-scaler & Karpenter

Beyond Containers: Exploring the MicroVM Revolution 📦 Part 4: Who Uses Firecracker and Why – Industry Adoption Stories

I Built an On-Demand Counter-Strike 2 Server on AWS — Here’s What Happened

Terraform/AWS Tutorial: Interconnecting VPCs with Transit Gateway

A strategy to adopt serverless computing.

Launch application using EFS by Terraform

Kubernetes Auto-Scaling - Horizontal Pod Autoscaler (HPA), Cluster Autoscaler (CA) & Pod Disruption Budget (PDB) - A Complete Walkthrough

🚀 Built a Production-Style 3-Tier Architecture on AWS and Learned It the Hard Way

Explore content categories

Solution

More articles by Jishnu Kinwar

AI, Privacy, and the Shift to Edge: DeepSeek Just Made the Case Stronger

‘Getting right people on the bus and wrong people off the bus’ Great! how do you know who is who

Is Customer Success Management a Myth or Reality

DNS resolution issues at scale

Running Kubernetes at scale in AWS EKS

Monolith to Microservices on GCP

Google Cloud Storage Vs AWS S3

How to scan millions of files on AWS S3

Why cloud computing is not like a utility?

How to explain frequent job changes to prospective employer?

Others also viewed

Containerize Window Application & deploy into AWS Elastic Kubernetes Service (EKS)

Getting Started With Kubernetes on Amazon EKS | Part 1

AWS EKS Autoscaling using Kubernetes Cluster Auto-scaler & Karpenter

Beyond Containers: Exploring the MicroVM Revolution 📦 Part 4: Who Uses Firecracker and Why – Industry Adoption Stories

I Built an On-Demand Counter-Strike 2 Server on AWS — Here’s What Happened

Terraform/AWS Tutorial: Interconnecting VPCs with Transit Gateway

A strategy to adopt serverless computing.

Launch application using EFS by Terraform

Kubernetes Auto-Scaling - Horizontal Pod Autoscaler (HPA), Cluster Autoscaler (CA) & Pod Disruption Budget (PDB) - A Complete Walkthrough

🚀 Built a Production-Style 3-Tier Architecture on AWS and Learned It the Hard Way

Explore content categories