Hanging kubernetes master node
This blog tries to solve following problems. Which for me was different symptoms of the same problem:
Note: This will also lead to k8s master node Out-of-service in AWS application load balancer.
In the Beginning
The issue started after making image changes and doing rolling update of the cluster. However, I believe the commands and troubleshooting steps mentioned should be usable in wide context, even when you are not on AWS or are not using Kops to manage kubernetes cluster.
The Precursor
For sake of completeness, following are the precursory commands for kops rolling update.
You set the context:
export KOPS_STATE_STORE=s3://path-to-s3-bucket
export AWS_SECRET_ACCESS_KEY=<AWS Secret Access Key>
export AWS_ACCESS_KEY_ID=<AWS Access Key ID>
kops export kubecfg --name k8s-cluster.k8s.local
Commands to get cluster related information:
kops get clusters
kops get instancegroups
Before edited nodes you validate the cluster:
kops validate cluster
kops edit ig nodes
In my case I updated master and nodes with new image name, followed by following commands.
kops update cluster k8s-cluster.k8s.local
kops update cluster k8s-cluster.k8s.local --yes
kops rolling-update cluster k8s-cluster.k8s.local --yes --v 10
Note: In case of troubleshooting, it is always advisable to run kops commands with verbosity level of 10.
Issue
Then something strange happened. After updating one master instance, it was stuck in a kind of infinite loop with following log message:
I1109 06:39:54.642739 9097 round_trippers.go:419] curl -k -v -XGET -H “Accept: application/json, */*” -H “User-Agent: kubectl/v1.15.3 (linux/amd64) kubernetes/2d3c76f” -H “Authorization: Basic YWRtaW46TmhtMWg3djB6d05VT2xMMHJGSzZHb1dMVWF1eERCSmU=” ‘https://api-k8s-cluster-lt5hsm-1336605497.aws-region-code.elb.amazonaws.com/api/v1/namespaces/kube-system/pods/dns-controller-547884bc7f-jfj27'
I1109 06:39:54.645959 9097 round_trippers.go:438] GET https://api-k8s-cluster-lt5hsm-1336605497.aws-region-code.elb.amazonaws.com/api/v1/namespaces/kube-system/pods/dns-controller-547884bc7f-jfj27 200 OK in 3 milliseconds
I1109 06:39:54.645975 9097 round_trippers.go:444] Response Headers:
I1109 06:39:54.645982 9097 round_trippers.go:447] Content-Type: application/json
I1109 06:39:54.645988 9097 round_trippers.go:447] Content-Length: 3305
I1109 06:39:54.645995 9097 round_trippers.go:447] Date: Sat, 09 Nov 2019 06:39:54 GMT
I1109 06:39:54.646055 9097 request.go:947] Response Body: {“kind”:”Pod”,”apiVersion”:”v1",”metadata”:{“name”:”dns-controller-547884bc7f-jfj27",”generateName”:”dns-controller-547884bc7f-”,”namespace”:”kube-system”,”selfLink”:”/api/v1/namespaces/kube-system/pods/dns-controller-547884bc7f-jfj27",”uid”:”0e0f2f59-e0e2–11e9–9575–0a84c4d3f852",”resourceVersion”:”10118794",”creationTimestamp”:”2019–09–27T04:48:31Z”,”deletionTimestamp”:”2019–10–26T03:28:22Z”,”deletionGracePeriodSeconds”:30,”labels”:{“k8s-addon”:”dns-controller.addons.k8s.io”,”k8s-app”:”dns-controller”,”pod-template-hash”:”1034406739",”version”:”v1.11.0"},”annotations”:{“scheduler.alpha.kubernetes.io/critical-pod”:””,”scheduler.alpha.kubernetes.io/tolerations”:”[{\”key\”: \”dedicated\”, \”value\”: \”master\”}]”},”ownerReferences”:[{“apiVersion”:”apps/v1",”kind”:”ReplicaSet”,”name”:”dns-controller-547884bc7f”,”uid”:”c3652bb7-c403–11e9-b0da-023fba3c4bdc”,”controller”:true,”blockOwnerDeletion”:true}]},”spec”:{“volumes”:[{“name”:”dns-controller-token-nzhv2",”secret”:{“secretName”:”dns-controller-token-nzhv2",”defaultMode”:420}}],”containers”:[{“name”:”dns-controller”,”image”:”kope/dns-controller:1.11.0",”command”:[“/usr/bin/dns-controller”,” — watch-ingress=false”,” — dns=gossip”,” — gossip-seed=127.0.0.1:3999",” — zone=*/*”,”-v=2"],”resources”:{“requests”:{“cpu”:”50m”,”memory”:”50Mi”}},”volumeMounts”:[{“name”:”dns-controller-token-nzhv2",”readOnly”:true,”mountPath”:”/var/run/secrets/kubernetes.io/serviceaccount”}],”terminationMessagePath”:”/dev/termination-log”,”terminationMessagePolicy”:”File”,”imagePullPolicy”:”IfNotPresent”}],”restartPolicy”:”Always”,”terminationGracePeriodSeconds”:30,”dnsPolicy”:”Default”,”nodeSelector”:{“node-role.kubernetes.io/master”:””},”serviceAccountName”:”dns-controller”,”serviceAccount”:”dns-controller”,”nodeName”:”ip-problematic-master-node.aws-region-code.compute.internal”,”hostNetwork”:true,”securityContext”:{},”schedulerName”:”default-scheduler”,”tolerations”:[{“key”:”node-role.kubernetes.io/master”,”effect”:”NoSchedule”},{“key”:”node.kubernetes.io/not-ready”,”operator”:”Exists”,”effect”:”NoExecute”,”tolerationSeconds”:300},{“key”:”node.kubernetes.io/unreachable”,”operator”:”Exists”,”effect”:”NoExecute”,”tolerationSeconds”:300}],”priority”:0},”status”:{“phase”:”Running”,”conditions”:[{“type”:”Initialized”,”status”:”True”,”lastProbeTime”:null,”lastTransitionTime”:”2019–09–27T04:48:31Z”},{“type”:”Ready”,”status”:”False”,”lastProbeTime”:null,”lastTransitionTime”:”2019–09–27T04:48:38Z”},{“type”:”ContainersReady”,”status”:”True”,”lastProbeTime”:null,”lastTransitionTime”:null},{“type”:”PodScheduled”,”status”:”True”,”lastProbeTime”:null,”lastTransitionTime”:”2019–09–27T04:48:31Z”}],”message”:”Node ip-problematic-master-node.aws-region-code.compute.internal which was running pod dns-controller-547884bc7f-jfj27 is unresponsive”,”reason”:”NodeLost”,”hostIP”:”<masternode-internal-ip>”,”podIP”:”<masternode-internal-ip>”,”startTime”:”2019–09–27T04:48:31Z”,”containerStatuses”:[{“name”:”dns-controller”,”state”:{“running”:{“startedAt”:”2019–09–27T04:48:38Z”}},”lastState”:{},”ready”:true,”restartCount”:0,”image”:”kope/dns-controller:1.11.0",”imageID”:”docker-pullable://kope/dns-controller@sha256:8fe88797f537df25615899d3bdd380632746320bb559cb220b854dc26334adc4",”containerID”:”docker://9db09361b4bcb35acaf4742fc4858d787b53660fbf7c1d683e6eb42aadaeca57"}],”qosClass”:”Burstable”}}
Troubleshooting
On one of the nodes, kube system pods were in unknown state:
ubuntu@localhost:~$ kubectl get pods --all-namespaces | grep -i unknown
kube-system dns-controller-547884bc7f-jfj27 1/1 Unknown 0 46d
kube-system etcd-server-events-ip-problematic-master-node.aws-region-code.compute.internal 1/1 Unknown 0 49d
kube-system etcd-server-ip-problematic-master-node.aws-region-code.compute.internal 1/1 Unknown 373 49d
kube-system kube-apiserver-ip-problematic-master-node.aws-region-code.compute.internal 1/1 Unknown 372 49d
kube-system kube-controller-manager-ip-problematic-master-node.aws-region-code.compute.internal 1/1 Unknown 3 49d
kube-system kube-proxy-ip-problematic-master-node.aws-region-code.compute.internal 1/1 Unknown 0 49d
kube-system kube-scheduler-ip-problematic-master-node.aws-region-code.compute.internal 1/1 Unknown 3 49d’’’
Draining the node didn’t work. It went into same infinite loop as in kops rolling update:
Recommended by LinkedIn
ubuntu@localhost:~$ kubectl drain ip-problematic-master-node.aws-region-code.compute.internal --ignore-daemonsets -v 10
I1112 06:12:44.541592 31175 round_trippers.go:419] curl -k -v -XGET -H “Accept: application/json, */*” -H “User-Agent: kubectl/v1.15.3 (linux/amd64) kubernetes/2d3c76f” -H “Authorization: Basic YWRtaW46TmhtMWg3djB6d05VT2xMMHJGSzZHb1dMVWF1eERCSmU=” ‘https://api-k8s-cluster-lt5hsm-1336605497.aws-region-code.elb.amazonaws.com/api/v1/namespaces/kube-system/pods/dns-controller-547884bc7f-jfj27'
I1112 06:12:44.546354 31175 round_trippers.go:438] GET https://api-k8s-cluster-lt5hsm-1336605497.aws-region-code.elb.amazonaws.com/api/v1/namespaces/kube-system/pods/dns-controller-547884bc7f-jfj27 200 OK in 4 milliseconds
I1112 06:12:44.546370 31175 round_trippers.go:444] Response Headers:
I1112 06:12:44.546377 31175 round_trippers.go:447] Content-Type: application/json
I1112 06:12:44.546384 31175 round_trippers.go:447] Content-Length: 3305
I1112 06:12:44.546390 31175 round_trippers.go:447] Date: Tue, 12 Nov 2019 06:12:44 GMT
I1112 06:12:44.546425 31175 request.go:947] Response Body: {“kind”:”Pod”,”apiVersion”:”v1",”metadata”:{“name”:”dns-controller-547884bc7f-jfj27",”generateName”:”dns-controller-547884bc7f-”,”namespace”:”kube-system”,”selfLink”:”/api/v1/namespaces/kube-system/pods/dns-controller-547884bc7f-jfj27",”uid”:”0e0f2f59-e0e2–11e9–9575–0a84c4d3f852",”resourceVersion”:”10118794",”creationTimestamp”:”2019–09–27T04:48:31Z”,”deletionTimestamp”:”2019–10–26T03:28:22Z”,”deletionGracePeriodSeconds”:30,”labels”:{“k8s-addon”:”dns-controller.addons.k8s.io”,”k8s-app”:”dns-controller”,”pod-template-hash”:”1034406739",”version”:”v1.11.0"},”annotations”:{“scheduler.alpha.kubernetes.io/critical-pod”:””,”scheduler.alpha.kubernetes.io/tolerations”:”[{\”key\”: \”dedicated\”, \”value\”: \”master\”}]”},”ownerReferences”:[{“apiVersion”:”apps/v1",”kind”:”ReplicaSet”,”name”:”dns-controller-547884bc7f”,”uid”:”c3652bb7-c403–11e9-b0da-023fba3c4bdc”,”controller”:true,”blockOwnerDeletion”:true}]},”spec”:{“volumes”:[{“name”:”dns-controller-token-nzhv2",”secret”:{“secretName”:”dns-controller-token-nzhv2",”defaultMode”:420}}],”containers”:[{“name”:”dns-controller”,”image”:”kope/dns-controller:1.11.0",”command”:[“/usr/bin/dns-controller”,” — watch-ingress=false”,” — dns=gossip”,” — gossip-seed=127.0.0.1:3999",” — zone=*/*”,”-v=2"],”resources”:{“requests”:{“cpu”:”50m”,”memory”:”50Mi”}},”volumeMounts”:[{“name”:”dns-controller-token-nzhv2",”readOnly”:true,”mountPath”:”/var/run/secrets/kubernetes.io/serviceaccount”}],”terminationMessagePath”:”/dev/termination-log”,”terminationMessagePolicy”:”File”,”imagePullPolicy”:”IfNotPresent”}],”restartPolicy”:”Always”,”terminationGracePeriodSeconds”:30,”dnsPolicy”:”Default”,”nodeSelector”:{“node-role.kubernetes.io/master”:””},”serviceAccountName”:”dns-controller”,”serviceAccount”:”dns-controller”,”nodeName”:”ip-problematic-master-node.aws-region-code.compute.internal”,”hostNetwork”:true,”securityContext”:{},”schedulerName”:”default-scheduler”,”tolerations”:[{“key”:”node-role.kubernetes.io/master”,”effect”:”NoSchedule”},{“key”:”node.kubernetes.io/not-ready”,”operator”:”Exists”,”effect”:”NoExecute”,”tolerationSeconds”:300},{“key”:”node.kubernetes.io/unreachable”,”operator”:”Exists”,”effect”:”NoExecute”,”tolerationSeconds”:300}],”priority”:0},”status”:{“phase”:”Running”,”conditions”:[{“type”:”Initialized”,”status”:”True”,”lastProbeTime”:null,”lastTransitionTime”:”2019–09–27T04:48:31Z”},{“type”:”Ready”,”status”:”False”,”lastProbeTime”:null,”lastTransitionTime”:”2019–09–27T04:48:38Z”},{“type”:”ContainersReady”,”status”:”True”,”lastProbeTime”:null,”lastTransitionTime”:null},{“type”:”PodScheduled”,”status”:”True”,”lastProbeTime”:null,”lastTransitionTime”:”2019–09–27T04:48:31Z”}],”message”:”Node ip-problematic-master-node.aws-region-code.compute.internal which was running pod dns-controller-547884bc7f-jfj27 is unresponsive”,”reason”:”NodeLost”,”hostIP”:”<masternode-internal-ip>”,”podIP”:”<masternode-internal-ip>”,”startTime”:”2019–09–27T04:48:31Z”,”containerStatuses”:[{“name”:”dns-controller”,”state”:{“running”:{“startedAt”:”2019–09–27T04:48:38Z”}},”lastState”:{},”ready”:true,”restartCount”:0,”image”:”kope/dns-controller:1.11.0",”imageID”:”docker-pullable://kope/dns-controller@sha256:8fe88797f537df25615899d3bdd380632746320bb559cb220b854dc26334adc4",”containerID”:”docker://9db09361b4bcb35acaf4742fc4858d787b53660fbf7c1d683e6eb42aadaeca57"}],”qosClass”:”Burstable”}}
Note: Basically --ignore-daemonsets, ignores DaemonSet-managed pods and tries to drain them as well.
Force deleting the pods didn’t work:
ubuntu@localhost:~$ kubectl delete pod dns-controller-547884bc7f-jfj27 --grace-period=0 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Error from server (NotFound): pods “dns-controller-547884bc7f-jfj27” not found
ubuntu@localhost:~$ kubectl delete pod etcd-server-events-ip-problematic-master-node.aws-region-code.compute.internal — grace-period=0 — force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Error from server (NotFound): pods “etcd-server-events-ip-problematic-master-node.aws-region-code.compute.internal” not found
Root cause
After logging into the problem master node, found that /var/lib/dockerhad too many directories inside it:
root@ip-problematic-master-node:/var/lib/docker/containers# ls | wc -l
15979
root@ip-problematic-master-node:/var/lib/docker/containers# cd ../overlay2/
root@ip-problematic-master-node:/var/lib/docker/overlay2# ls | wc -l
31986
This was not only filling up the disk space it was filling up inodes as well. Compare this with number of files in a healthy system.
root@ip-healthy-master-node:/var/lib/docker/containers# ls | wc -l
21
root@ip-healthy-master-node:/var/lib/docker/containers# cd ../overlay2/
root@ip-healthy-master-node:/var/lib/docker/overlay2# ls | wc -l
82
So the root cause was too many files in docker container. Why this was happening I was unable to figure out. May be detailed monitoring with continuous snapshots of following commands will help:
docker ps
docker system df
What didn’t help or worked for me
With inode and disk filled up, none of the docker commands like “docker ps” or “docker help” will run.
You can try:
This will fix the docker daemon but not the master node. Rebooting master node itself didn’t fix. As per my understanding, once kubernetes is in bad terms with docker, it is hard to fix the relationship.
Solution
To solve this problem you have to terminate the problematic master node instance and create a new. In case of kubernetes infra managed by kops, it will automatically create a new master node instance.
Proactive solution
To avoid this issue happening in future, you have to set up a daily cron job to prune the system:
docker system prune --force
Note: Latest version of docker prune support filter option where you can specify something like "until=23h".
The above job/command cleans up all stopped containers, all dangling images, and all unused networks that are created more than 23 hours ago. Based on your requirement you will have to set the frequency. For example for development systems you might want to increase/decrease time.
However, if your system ends up in this shape, then from my experience only terminating the instance is the solution.
Hope this helps. Please do share you feedback.
Ashish Gupta very well drafted and insightful.