Hanging kubernetes master node

Ashish Gupta

Published Dec 7, 2019

+ Follow

This blog tries to solve following problems. Which for me was different symptoms of the same problem:

Kops rolling-update stuck in infinite loop
Unable to drain a kubernetes master node
All pods on master node getting stuck in Unknown state

Note: This will also lead to k8s master node Out-of-service in AWS application load balancer.

In the Beginning

The issue started after making image changes and doing rolling update of the cluster. However, I believe the commands and troubleshooting steps mentioned should be usable in wide context, even when you are not on AWS or are not using Kops to manage kubernetes cluster.

The Precursor

For sake of completeness, following are the precursory commands for kops rolling update.

You set the context:

export KOPS_STATE_STORE=s3://path-to-s3-bucket
export AWS_SECRET_ACCESS_KEY=<AWS Secret Access Key>
export AWS_ACCESS_KEY_ID=<AWS Access Key ID>
kops export kubecfg --name k8s-cluster.k8s.local

Commands to get cluster related information:

kops get clusters
kops get instancegroups

Before edited nodes you validate the cluster:

kops validate cluster
kops edit ig nodes

In my case I updated master and nodes with new image name, followed by following commands.

kops update cluster k8s-cluster.k8s.local
kops update cluster k8s-cluster.k8s.local --yes
kops rolling-update cluster k8s-cluster.k8s.local --yes --v 10

Note: In case of troubleshooting, it is always advisable to run kops commands with verbosity level of 10.

Issue

Then something strange happened. After updating one master instance, it was stuck in a kind of infinite loop with following log message:

I1109 06:39:54.642739 9097 round_trippers.go:419] curl -k -v -XGET -H “Accept: application/json, */*” -H “User-Agent: kubectl/v1.15.3 (linux/amd64) kubernetes/2d3c76f” -H “Authorization: Basic YWRtaW46TmhtMWg3djB6d05VT2xMMHJGSzZHb1dMVWF1eERCSmU=” ‘https://api-k8s-cluster-lt5hsm-1336605497.aws-region-code.elb.amazonaws.com/api/v1/namespaces/kube-system/pods/dns-controller-547884bc7f-jfj27'
I1109 06:39:54.645959 9097 round_trippers.go:438] GET https://api-k8s-cluster-lt5hsm-1336605497.aws-region-code.elb.amazonaws.com/api/v1/namespaces/kube-system/pods/dns-controller-547884bc7f-jfj27 200 OK in 3 milliseconds
I1109 06:39:54.645975 9097 round_trippers.go:444] Response Headers:
I1109 06:39:54.645982 9097 round_trippers.go:447] Content-Type: application/json
I1109 06:39:54.645988 9097 round_trippers.go:447] Content-Length: 3305
I1109 06:39:54.645995 9097 round_trippers.go:447] Date: Sat, 09 Nov 2019 06:39:54 GMT
I1109 06:39:54.646055 9097 request.go:947] Response Body: {“kind”:”Pod”,”apiVersion”:”v1",”metadata”:{“name”:”dns-controller-547884bc7f-jfj27",”generateName”:”dns-controller-547884bc7f-”,”namespace”:”kube-system”,”selfLink”:”/api/v1/namespaces/kube-system/pods/dns-controller-547884bc7f-jfj27",”uid”:”0e0f2f59-e0e2–11e9–9575–0a84c4d3f852",”resourceVersion”:”10118794",”creationTimestamp”:”2019–09–27T04:48:31Z”,”deletionTimestamp”:”2019–10–26T03:28:22Z”,”deletionGracePeriodSeconds”:30,”labels”:{“k8s-addon”:”dns-controller.addons.k8s.io”,”k8s-app”:”dns-controller”,”pod-template-hash”:”1034406739",”version”:”v1.11.0"},”annotations”:{“scheduler.alpha.kubernetes.io/critical-pod”:””,”scheduler.alpha.kubernetes.io/tolerations”:”[{\”key\”: \”dedicated\”, \”value\”: \”master\”}]”},”ownerReferences”:[{“apiVersion”:”apps/v1",”kind”:”ReplicaSet”,”name”:”dns-controller-547884bc7f”,”uid”:”c3652bb7-c403–11e9-b0da-023fba3c4bdc”,”controller”:true,”blockOwnerDeletion”:true}]},”spec”:{“volumes”:[{“name”:”dns-controller-token-nzhv2",”secret”:{“secretName”:”dns-controller-token-nzhv2",”defaultMode”:420}}],”containers”:[{“name”:”dns-controller”,”image”:”kope/dns-controller:1.11.0",”command”:[“/usr/bin/dns-controller”,” — watch-ingress=false”,” — dns=gossip”,” — gossip-seed=127.0.0.1:3999",” — zone=*/*”,”-v=2"],”resources”:{“requests”:{“cpu”:”50m”,”memory”:”50Mi”}},”volumeMounts”:[{“name”:”dns-controller-token-nzhv2",”readOnly”:true,”mountPath”:”/var/run/secrets/kubernetes.io/serviceaccount”}],”terminationMessagePath”:”/dev/termination-log”,”terminationMessagePolicy”:”File”,”imagePullPolicy”:”IfNotPresent”}],”restartPolicy”:”Always”,”terminationGracePeriodSeconds”:30,”dnsPolicy”:”Default”,”nodeSelector”:{“node-role.kubernetes.io/master”:””},”serviceAccountName”:”dns-controller”,”serviceAccount”:”dns-controller”,”nodeName”:”ip-problematic-master-node.aws-region-code.compute.internal”,”hostNetwork”:true,”securityContext”:{},”schedulerName”:”default-scheduler”,”tolerations”:[{“key”:”node-role.kubernetes.io/master”,”effect”:”NoSchedule”},{“key”:”node.kubernetes.io/not-ready”,”operator”:”Exists”,”effect”:”NoExecute”,”tolerationSeconds”:300},{“key”:”node.kubernetes.io/unreachable”,”operator”:”Exists”,”effect”:”NoExecute”,”tolerationSeconds”:300}],”priority”:0},”status”:{“phase”:”Running”,”conditions”:[{“type”:”Initialized”,”status”:”True”,”lastProbeTime”:null,”lastTransitionTime”:”2019–09–27T04:48:31Z”},{“type”:”Ready”,”status”:”False”,”lastProbeTime”:null,”lastTransitionTime”:”2019–09–27T04:48:38Z”},{“type”:”ContainersReady”,”status”:”True”,”lastProbeTime”:null,”lastTransitionTime”:null},{“type”:”PodScheduled”,”status”:”True”,”lastProbeTime”:null,”lastTransitionTime”:”2019–09–27T04:48:31Z”}],”message”:”Node ip-problematic-master-node.aws-region-code.compute.internal which was running pod dns-controller-547884bc7f-jfj27 is unresponsive”,”reason”:”NodeLost”,”hostIP”:”<masternode-internal-ip>”,”podIP”:”<masternode-internal-ip>”,”startTime”:”2019–09–27T04:48:31Z”,”containerStatuses”:[{“name”:”dns-controller”,”state”:{“running”:{“startedAt”:”2019–09–27T04:48:38Z”}},”lastState”:{},”ready”:true,”restartCount”:0,”image”:”kope/dns-controller:1.11.0",”imageID”:”docker-pullable://kope/dns-controller@sha256:8fe88797f537df25615899d3bdd380632746320bb559cb220b854dc26334adc4",”containerID”:”docker://9db09361b4bcb35acaf4742fc4858d787b53660fbf7c1d683e6eb42aadaeca57"}],”qosClass”:”Burstable”}}

Troubleshooting

On one of the nodes, kube system pods were in unknown state:

ubuntu@localhost:~$ kubectl get pods --all-namespaces | grep -i unknown
kube-system dns-controller-547884bc7f-jfj27 1/1 Unknown 0 46d
kube-system etcd-server-events-ip-problematic-master-node.aws-region-code.compute.internal 1/1 Unknown 0 49d
kube-system etcd-server-ip-problematic-master-node.aws-region-code.compute.internal 1/1 Unknown 373 49d
kube-system kube-apiserver-ip-problematic-master-node.aws-region-code.compute.internal 1/1 Unknown 372 49d
kube-system kube-controller-manager-ip-problematic-master-node.aws-region-code.compute.internal 1/1 Unknown 3 49d
kube-system kube-proxy-ip-problematic-master-node.aws-region-code.compute.internal 1/1 Unknown 0 49d
kube-system kube-scheduler-ip-problematic-master-node.aws-region-code.compute.internal 1/1 Unknown 3 49d’’’

Draining the node didn’t work. It went into same infinite loop as in kops rolling update:

Recommended by LinkedIn

INTRODUCTION

Abid Anjum 2 years ago

Kubernetes Draining Nodes Properly

Guillermo Q. 3 years ago

Static Analysis of terraform code with terrascan

Ed Oatley 1 year ago

ubuntu@localhost:~$ kubectl drain ip-problematic-master-node.aws-region-code.compute.internal --ignore-daemonsets -v 10
I1112 06:12:44.541592 31175 round_trippers.go:419] curl -k -v -XGET -H “Accept: application/json, */*” -H “User-Agent: kubectl/v1.15.3 (linux/amd64) kubernetes/2d3c76f” -H “Authorization: Basic YWRtaW46TmhtMWg3djB6d05VT2xMMHJGSzZHb1dMVWF1eERCSmU=” ‘https://api-k8s-cluster-lt5hsm-1336605497.aws-region-code.elb.amazonaws.com/api/v1/namespaces/kube-system/pods/dns-controller-547884bc7f-jfj27'
I1112 06:12:44.546354 31175 round_trippers.go:438] GET https://api-k8s-cluster-lt5hsm-1336605497.aws-region-code.elb.amazonaws.com/api/v1/namespaces/kube-system/pods/dns-controller-547884bc7f-jfj27 200 OK in 4 milliseconds
I1112 06:12:44.546370 31175 round_trippers.go:444] Response Headers:
I1112 06:12:44.546377 31175 round_trippers.go:447] Content-Type: application/json
I1112 06:12:44.546384 31175 round_trippers.go:447] Content-Length: 3305
I1112 06:12:44.546390 31175 round_trippers.go:447] Date: Tue, 12 Nov 2019 06:12:44 GMT
I1112 06:12:44.546425 31175 request.go:947] Response Body: {“kind”:”Pod”,”apiVersion”:”v1",”metadata”:{“name”:”dns-controller-547884bc7f-jfj27",”generateName”:”dns-controller-547884bc7f-”,”namespace”:”kube-system”,”selfLink”:”/api/v1/namespaces/kube-system/pods/dns-controller-547884bc7f-jfj27",”uid”:”0e0f2f59-e0e2–11e9–9575–0a84c4d3f852",”resourceVersion”:”10118794",”creationTimestamp”:”2019–09–27T04:48:31Z”,”deletionTimestamp”:”2019–10–26T03:28:22Z”,”deletionGracePeriodSeconds”:30,”labels”:{“k8s-addon”:”dns-controller.addons.k8s.io”,”k8s-app”:”dns-controller”,”pod-template-hash”:”1034406739",”version”:”v1.11.0"},”annotations”:{“scheduler.alpha.kubernetes.io/critical-pod”:””,”scheduler.alpha.kubernetes.io/tolerations”:”[{\”key\”: \”dedicated\”, \”value\”: \”master\”}]”},”ownerReferences”:[{“apiVersion”:”apps/v1",”kind”:”ReplicaSet”,”name”:”dns-controller-547884bc7f”,”uid”:”c3652bb7-c403–11e9-b0da-023fba3c4bdc”,”controller”:true,”blockOwnerDeletion”:true}]},”spec”:{“volumes”:[{“name”:”dns-controller-token-nzhv2",”secret”:{“secretName”:”dns-controller-token-nzhv2",”defaultMode”:420}}],”containers”:[{“name”:”dns-controller”,”image”:”kope/dns-controller:1.11.0",”command”:[“/usr/bin/dns-controller”,” — watch-ingress=false”,” — dns=gossip”,” — gossip-seed=127.0.0.1:3999",” — zone=*/*”,”-v=2"],”resources”:{“requests”:{“cpu”:”50m”,”memory”:”50Mi”}},”volumeMounts”:[{“name”:”dns-controller-token-nzhv2",”readOnly”:true,”mountPath”:”/var/run/secrets/kubernetes.io/serviceaccount”}],”terminationMessagePath”:”/dev/termination-log”,”terminationMessagePolicy”:”File”,”imagePullPolicy”:”IfNotPresent”}],”restartPolicy”:”Always”,”terminationGracePeriodSeconds”:30,”dnsPolicy”:”Default”,”nodeSelector”:{“node-role.kubernetes.io/master”:””},”serviceAccountName”:”dns-controller”,”serviceAccount”:”dns-controller”,”nodeName”:”ip-problematic-master-node.aws-region-code.compute.internal”,”hostNetwork”:true,”securityContext”:{},”schedulerName”:”default-scheduler”,”tolerations”:[{“key”:”node-role.kubernetes.io/master”,”effect”:”NoSchedule”},{“key”:”node.kubernetes.io/not-ready”,”operator”:”Exists”,”effect”:”NoExecute”,”tolerationSeconds”:300},{“key”:”node.kubernetes.io/unreachable”,”operator”:”Exists”,”effect”:”NoExecute”,”tolerationSeconds”:300}],”priority”:0},”status”:{“phase”:”Running”,”conditions”:[{“type”:”Initialized”,”status”:”True”,”lastProbeTime”:null,”lastTransitionTime”:”2019–09–27T04:48:31Z”},{“type”:”Ready”,”status”:”False”,”lastProbeTime”:null,”lastTransitionTime”:”2019–09–27T04:48:38Z”},{“type”:”ContainersReady”,”status”:”True”,”lastProbeTime”:null,”lastTransitionTime”:null},{“type”:”PodScheduled”,”status”:”True”,”lastProbeTime”:null,”lastTransitionTime”:”2019–09–27T04:48:31Z”}],”message”:”Node ip-problematic-master-node.aws-region-code.compute.internal which was running pod dns-controller-547884bc7f-jfj27 is unresponsive”,”reason”:”NodeLost”,”hostIP”:”<masternode-internal-ip>”,”podIP”:”<masternode-internal-ip>”,”startTime”:”2019–09–27T04:48:31Z”,”containerStatuses”:[{“name”:”dns-controller”,”state”:{“running”:{“startedAt”:”2019–09–27T04:48:38Z”}},”lastState”:{},”ready”:true,”restartCount”:0,”image”:”kope/dns-controller:1.11.0",”imageID”:”docker-pullable://kope/dns-controller@sha256:8fe88797f537df25615899d3bdd380632746320bb559cb220b854dc26334adc4",”containerID”:”docker://9db09361b4bcb35acaf4742fc4858d787b53660fbf7c1d683e6eb42aadaeca57"}],”qosClass”:”Burstable”}}

Note: Basically --ignore-daemonsets, ignores DaemonSet-managed pods and tries to drain them as well.

Force deleting the pods didn’t work:

ubuntu@localhost:~$ kubectl delete pod dns-controller-547884bc7f-jfj27 --grace-period=0 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Error from server (NotFound): pods “dns-controller-547884bc7f-jfj27” not found
ubuntu@localhost:~$ kubectl delete pod etcd-server-events-ip-problematic-master-node.aws-region-code.compute.internal — grace-period=0 — force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Error from server (NotFound): pods “etcd-server-events-ip-problematic-master-node.aws-region-code.compute.internal” not found

Root cause

After logging into the problem master node, found that /var/lib/dockerhad too many directories inside it:

root@ip-problematic-master-node:/var/lib/docker/containers# ls | wc -l
15979
root@ip-problematic-master-node:/var/lib/docker/containers# cd ../overlay2/
root@ip-problematic-master-node:/var/lib/docker/overlay2# ls | wc -l
31986

This was not only filling up the disk space it was filling up inodes as well. Compare this with number of files in a healthy system.

root@ip-healthy-master-node:/var/lib/docker/containers# ls | wc -l
21
root@ip-healthy-master-node:/var/lib/docker/containers# cd ../overlay2/
root@ip-healthy-master-node:/var/lib/docker/overlay2# ls | wc -l
82

So the root cause was too many files in docker container. Why this was happening I was unable to figure out. May be detailed monitoring with continuous snapshots of following commands will help:

docker ps
docker system df

What didn’t help or worked for me

With inode and disk filled up, none of the docker commands like “docker ps” or “docker help” will run.

You can try:

stopping/killing docker
remove the large number of directories in /var/lib/docker/ containers and /var/lib/docker/overlay2
start docker again

This will fix the docker daemon but not the master node. Rebooting master node itself didn’t fix. As per my understanding, once kubernetes is in bad terms with docker, it is hard to fix the relationship.

Solution

To solve this problem you have to terminate the problematic master node instance and create a new. In case of kubernetes infra managed by kops, it will automatically create a new master node instance.

Proactive solution

To avoid this issue happening in future, you have to set up a daily cron job to prune the system:

docker system prune --force

Note: Latest version of docker prune support filter option where you can specify something like "until=23h".

The above job/command cleans up all stopped containers, all dangling images, and all unused networks that are created more than 23 hours ago. Based on your requirement you will have to set the frequency. For example for development systems you might want to increase/decrease time.

However, if your system ends up in this shape, then from my experience only terminating the instance is the solution.

Hope this helps. Please do share you feedback.

Parveen K. Sharma (Saraswat) 6y

Ashish Gupta very well drafted and insightful.

1 Reaction

To view or add a comment, sign in

Hanging kubernetes master node

Ashish Gupta

In the Beginning

The Precursor

Issue

Troubleshooting

Recommended by LinkedIn

Root cause

What didn’t help or worked for me

Solution

Proactive solution

More articles by Ashish Gupta

Others also viewed

Create/Launch an Application using Terraform

Docker / Containers - What, when and why?

🧭 Docker: A Developer’s Journey from Pain to Power

Launching an Instance using Terraform with Jenkins

Deploy .Net Framework to Azure Kubernetes

Understanding Docker and Kubernetes (The Simple Way)🐳

Configure Docker and run httpd server over docker using Ansible

Managing a Wildfly J2EE Application Server "cluster" by using Docker Swarm orchestration

AWS: Running containers with Persistent Volumes on Kubernetes using Ansible

How to Troubleshoot KUBERNETES Issues

Troubleshooting Kubernetes Pod Creation Issues

How to Manage Pod Balancing in Kubernetes

Troubleshooting Unreachable Kubernetes Pods

Troubleshooting Kubernetes Rollout and Storage Issues

How to Streamline Kubernetes Cluster Setup

Managing Shared Access in Kubernetes Clusters

Explore content categories

In the Beginning

The Precursor

Issue

Troubleshooting

Recommended by LinkedIn

Root cause

What didn’t help or worked for me

Solution

Proactive solution

More articles by Ashish Gupta

🧩 Choosing a city car for Bangalore traffic – Tata Punch, Hyundai Exter, Nissan Magnite, Citroën C3, and Maruti Fronx compared

Small automatic Indian car comparison

Manually delete an AWS VPC

Kuberenetes: Unable to exec to the container

Elasticsearch and the /tmp partition

Esticsearch vs InfluxDb

Overview and Comparison of Time Series Databases

How to secure your system and network

Securing SSH daemon by sshd config file

Opendistro Elasticsearch jwt auth for multiple certificates

Others also viewed

Create/Launch an Application using Terraform

Docker / Containers - What, when and why?

🧭 Docker: A Developer’s Journey from Pain to Power

Launching an Instance using Terraform with Jenkins

Deploy .Net Framework to Azure Kubernetes

Understanding Docker and Kubernetes (The Simple Way)🐳

Configure Docker and run httpd server over docker using Ansible

Managing a Wildfly J2EE Application Server "cluster" by using Docker Swarm orchestration

AWS: Running containers with Persistent Volumes on Kubernetes using Ansible

Similar topics

How to Troubleshoot KUBERNETES Issues

Troubleshooting Kubernetes Pod Creation Issues

How to Manage Pod Balancing in Kubernetes

Troubleshooting Unreachable Kubernetes Pods

Troubleshooting Kubernetes Rollout and Storage Issues

How to Streamline Kubernetes Cluster Setup

Managing Shared Access in Kubernetes Clusters

Explore content categories