Earlier this week I posted the first 5 most common reasons Kubernetes Deployments fail. Here is the remainder of the list - including some of the most frustrating!
6. Resource Quotas
Similar to resource limits, Kubernetes also allows admins to set Resource Quotas per namespace. These quotas can set soft & hard limits on resources such as number of Pods, Deployments, PersistentVolumes, CPUs, Memory, and more.
Let's see what happens when we exceed a Resource Quota. Here's our example Deployment again:
# test-quota.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: gateway-quota
spec:
template:
spec:
containers:
- name: test-container
image: nginx
We can create it with kubectl create -f test-quota.yaml
, then inspect our Pods.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
gateway-quota-551394438-pix5d 1/1 Running 0 16s
Looks good! Now let's scale up to 3 replicas, kubectl scale deploy/gateway-quota --replicas=3
, and then inspect our Pods again.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
gateway-quota-551394438-pix5d 1/1 Running 0 9m
Huh? Where are our pods? Let's inspect the Deployment.
$ kubectl describe deploy/gateway-quota
Name: gateway-quota
Namespace: fail
CreationTimestamp: Sat, 11 Feb 2017 16:33:16 -0500
Labels: app=gateway
Selector: app=gateway
Replicas: 1 updated | 3 total | 1 available | 2 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 1 max surge
OldReplicaSets:
NewReplicaSet: gateway-quota-551394438 (1/3 replicas created)
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
9m 9m 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set gateway-quota-551394438 to 1
5m 5m 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set gateway-quota-551394438 to 3
In the last line, we can see the ReplicaSet was told to scale to 3. Let's inspect the ReplicaSet using describe
to learn more.
kubectl describe replicaset gateway-quota-551394438
Name: gateway-quota-551394438
Namespace: fail
Image(s): nginx
Selector: app=gateway,pod-template-hash=551394438
Labels: app=gateway
pod-template-hash=551394438
Replicas: 1 current / 3 desired
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
No volumes.
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
11m 11m 1 {replicaset-controller } Normal SuccessfulCreate Created pod: gateway-quota-551394438-pix5d
11m 30s 33 {replicaset-controller } Warning FailedCreate Error creating: pods "gateway-quota-551394438-" is forbidden: exceeded quota: compute-resources, requested: pods=1, used: pods=1, limited: pods=1
Aha! Our ReplicaSet wasn't able to create any more pods due to the quota: exceeded quota: compute-resources, requested: pods=1, used: pods=1, limited: pods=1
. Similar to Resource Limits, we have three options:
- Ask your cluster admin to increase the Quota for this namespace
- Delete or scale back other Deployments in this namespace
- Go rogue and edit the Quota
7. Insufficient Cluster Resources
Unless your cluster administrator has wired up the cluster-autoscaler, chances are that someday you will run out of CPU or Memory resources in your cluster.
That's not to say that CPU & Memory are fully utilized - just that they have been fully accounted for by the Kubernetes Scheduler. As we saw in #5, Cluster Administrators can limit the amount of CPU or memory a developer can request to be allocated to a Pod or container. Wise administrators will also set a default CPU/Memory request that will be applied if you (the developer) don't request anything.
If you do all your work in the default
namespace, you probably have a default Container CPU Request of 100m
and you don't even know it! Check yours by running kubectl describe ns default
.
Let's say you have a Kubernetes cluster with 1 Node that has 1 CPU. Your Kubernetes cluster has 1000m
of available CPU to schedule.
Ignoring other system Pods (kubectl -n kube-system get pods
) for the moment, you will be able to deploy 10 Pods (with 1 container each at 100m
) to your single-Node cluster.
10 Pods * (1 Container * 100m) = 1000m == Cluster CPUs
So what happens when you turn it up to 11?
Here's an example Deployment that has a CPU request of 1CPU (1000m).
# cpu-scale.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: cpu-scale
spec:
template:
metadata:
labels:
app: cpu-scale
spec:
containers:
- name: test-container
image: nginx
resources:
requests:
cpu: 1
I'm deploying this application to a Cluster that has 2 total CPUs available. In addition to my cpu-scale
application, the Kubernetes internal services are consuming more CPU/Memory Requests.
So we can deploy this with kubectl create -f cpu-scale.yaml
, and then inspect the Pods:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cpu-scale-908056305-xstti 1/1 Running 0 5m
So the first Pod was scheduled and is running. Let's see what happens when we scale up by one:
$ kubectl scale deploy/cpu-scale --replicas=2
deployment "cpu-scale" scaled
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cpu-scale-908056305-phb4j 0/1 Pending 0 4m
cpu-scale-908056305-xstti 1/1 Running 0 5m
Uh oh. Our second Pod is stuck with a status of Pending
. We can describe
that second Pod for more information:
$ kubectl describe pod cpu-scale-908056305-phb4j
Name: cpu-scale-908056305-phb4j
Namespace: fail
Node: gke-ctm-1-sysdig2-35e99c16-qwds/10.128.0.4
Start Time: Sun, 12 Feb 2017 08:57:51 -0500
Labels: app=cpu-scale
pod-template-hash=908056305
Status: Pending
IP:
Controllers: ReplicaSet/cpu-scale-908056305
[...]
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
3m 3m 1 {default-scheduler } Warning FailedScheduling pod (cpu-scale-908056305-phb4j) failed to fit in any node
fit failure on node (gke-ctm-1-sysdig2-35e99c16-wx0s): Insufficient cpu
fit failure on node (gke-ctm-1-sysdig2-35e99c16-tgfm): Insufficient cpu
fit failure on node (gke-ctm-1-sysdig2-35e99c16-qwds): Insufficient cpu
Alright! So the Events block tells us that the Kubernetes scheduler (default-scheduler
) was unable to schedule this Pod because it failed to fit on any node. It even tells us which scalability aspect failed (Insufficient cpu
) for each Node.
So how do we resolve this? Well, if you've been too eager with the size of your Requested CPU/Memory, you could reduce the request size and re-deploy. Alternatively, you could kindly ask your Cluster admin to scale up the cluster (chances are you're not the only one running into this problem).
Now, you might be thinking to yourself: "Our Kubernetes Nodes are in auto-scaling groups with our Cloud Provider. Why aren't they working?"
The answer is that your cloud provider doesn't have any insight into what the Kubernetes Scheduler is doing. Leveraging the Kubernetes cluster-autoscaler will allow your Cluster to resize itself based on the Scheduler's requirements. If you're using Google Container Engine, the cluster-autoscaler is a Beta feature.
8. PersistentVolume fails to mount
Another common error is trying to create a Deployment that references PersistentVolumes that don't exist. Whether you're using PersistentVolumeClaims (which you should be!) or just directly accessing a PersistentDisk, the end result is very similar.
Here's our test Deployment that is trying to use a GCE PersistentDisk named my-data-disk
.
# volume-test.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: volume-test
spec:
template:
metadata:
labels:
app: volume-test
spec:
containers:
- name: test-container
image: nginx
volumeMounts:
- mountPath: /test
name: test-volume
volumes:
- name: test-volume
# This GCE PD must already exist (oops!)
gcePersistentDisk:
pdName: my-data-disk
fsType: ext4
Let's create this Deployment, kubectl create -f volume-test.yaml
and check the Pods after a few minutes.
kubectl get pods
NAME READY STATUS RESTARTS AGE
volume-test-3922807804-33nux 0/1 ContainerCreating 0 3m
Three minutes is a long time to wait for a Container to create. Let's inspect the Pod with describe
and see what's happening under the hood:
$ kubectl describe pod volume-test-3922807804-33nux
Name: volume-test-3922807804-33nux
Namespace: fail
Node: gke-ctm-1-sysdig2-35e99c16-qwds/10.128.0.4
Start Time: Sun, 12 Feb 2017 09:24:50 -0500
Labels: app=volume-test
pod-template-hash=3922807804
Status: Pending
IP:
Controllers: ReplicaSet/volume-test-3922807804
[...]
Volumes:
test-volume:
Type: GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
PDName: my-data-disk
FSType: ext4
Partition: 0
ReadOnly: false
[...]
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
4m 4m 1 {default-scheduler } Normal Scheduled Successfully assigned volume-test-3922807804-33nux to gke-ctm-1-sysdig2-35e99c16-qwds
1m 1m 1 {kubelet gke-ctm-1-sysdig2-35e99c16-qwds} Warning FailedMount Unable to mount volumes for pod "volume-test-3922807804-33nux_fail(e2180d94-f12e-11e6-bd01-42010af0012c)": timeout expired waiting for volumes to attach/mount for pod "volume-test-3922807804-33nux"/"fail". list of unattached/unmounted volumes=[test-volume]
1m 1m 1 {kubelet gke-ctm-1-sysdig2-35e99c16-qwds} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "volume-test-3922807804-33nux"/"fail". list of unattached/unmounted volumes=[test-volume]
3m 50s 3 {controller-manager } Warning FailedMount Failed to attach volume "test-volume" on node "gke-ctm-1-sysdig2-35e99c16-qwds" with: GCE persistent disk not found: diskName="my-data-disk" zone="us-central1-a"
Surprise! The Events
section holds the hidden clues we were looking for. Our Pod was correctly scheduled to a Node (Successfully assigned volume-test-3922807804-33nux to gke-ctm-1-sysdig2-35e99c16-qwds
), but then the Kubelet on that Node is unable to mount the expected volume, test-volume
. That Volume would have been created when the PersistentDisk was attached to the Node, but, as we see further down, the controller-manager failed Failed to attach volume "test-volume" on node "gke-ctm-1-sysdig2-35e99c16-qwds" with: GCE persistent disk not found: diskName="my-data-disk" zone="us-central1-a"
.
This last message is pretty clear: To resolve the issue, we need to create a PersistentVolume in GKE with the name, my-data-disk
in the zone, us-central1-a
. Once that disk is created, the controller-manager will mount the disk and kickstart the Container creation process.
9. Validation Errors
Few things are more frustrating than watching an entire build-test-deploy job get all the way to the deploy step, only to fail due to invalid Kubernetes Spec objects.
You may have gotten error like this before:
$ kubectl create -f test-application.deploy.yaml
error: error validating "test-application.deploy.yaml": error validating data: found invalid field resources for v1.PodSpec; if you choose to ignore these errors, turn validation off with --validate=false
In this example, I tried creating the following Kubernetes Deployment:
# test-application.deploy.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: test-app
spec:
template:
metadata:
labels:
app: test-app
spec:
containers:
- image: nginx
name: nginx
resources:
limits:
cpu: 100m
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
At first glance, this YAML looks fine - but the error message proves to be helpful. The error says that it found invalid field resources for v1.PodSpec
. Upon deeper inspection of the v1.PodSpec, we can see that the resources
object is (incorrectly) a child of v1.PodSpec
. It should be a child of v1.Container. After indenting the resources
object one level, this Deployment works just fine!
In addition to looking out for indentation mistakes, another common error is a typo in an Object name (e.g. peristentVolumeClaim
vs persistentVolumeClaim
). That one briefly tripped up a senior engineer and I when we were in a hurry.
To help catch these errors early, I recommend adding some verification steps to your pre-commit hooks or test-phase of your build.
For example, you can:
- Validate your YAML with
python -c 'import yaml,sys;yaml.safe_load(sys.stdin)' < test-application.deployment.yaml
- Validate your Kubernetes API objects using the
--dry-run
flag like this:kubectl create -f test-application.deploy.yaml --dry-run --validate=true
Important Note: This mechanism for validating Kubernetes Objects leverages server-side validation. This means that kubectl
must have a working Kubernetes cluster to communicate with. Unfortunately, there currently is not a client-side validation option for kubectl, but there are open issues (kubernetes/kubernetes #29410 and kubernetes/kubernetes #11488 that are tracking that missing feature.
10. Container Image Not Updating
Most people I've talked to who have worked with Kubernetes have run into this problem, and it's a real kicker.
The story goes something like this:
- Create a Deployment using an image tag (e.g.
rosskulinski/myapplication:v1
) - Notice that there's a bug in
myapplication
- Build a new image and push to the same tag (
rosskukulinski/myapplication:v1
) - Delete any
myapplication
Pods, watch new ones get created by the Deployment - Realize that the bug is still present
- Repeat 3-5 until you pull your hair out
The problem relates to how Kubernetes decides whether to do a docker pull
when starting a container in a Pod.
In the v1.Container specification there's an option called ImagePullPolicy
:
Image pull policy. One of Always, Never, IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.
Since we tagged our image as :v1
, the default pull policy is IfNotPresent. The Kubelet already has a local copy of rosskukulinski/myapplication:v1
, so it doesn't attempt to do a docker pull
. When the new Pods come up, they're still using the old broken Docker image.
There are three ways to resolve this:
- Switch to using
:latest
(DO NOT DO THIS!) - Specify
ImagePullPolicy: Always
in your Deployment. - Use unique tags (e.g. based on your source control commit id)
During development or if I'm quickly prototyping something, I will specify ImagePullPolicy: Always
so that I can build & push container images with the same tag.
However, in all of my production deployments I use unique tags based on the Git SHA-1 of the commit used to build that image. This makes it trivial to identify and check-out the source code that's running in production for any deployed application.
Summary
Phew! That's a lot of things to watch out for. By now, you should now be a pro at debugging, identifying, and fixing failed Kubernetes Deployments.
In general, most of the common deployment failures can be debugged using these commands:
kubectl describe deployment/<deployname>
kubectl describe replicaset/<rsname>
kubectl get pods
kubectl describe pod/<podname>
kubectl logs <podname> --previous
In the quest to automate myself out of a job, I created a bash script that runs anytime a CI/CD deployment fails. Helpful Kubernetes information will show up in the Jenkins/CircleCI/etc build output so that developers can quickly find any obvious problems.
I hope you have enjoyed these two posts!
How have you seen Kubernetes Deployments fail? Any other troubleshooting tips to share? Leave a comment!