04. November 2021

Basic Kubernetes Resource Management

Within this blog post, I will show you the advantages of some Kubernetes standard concepts like ResourceQuotas, LimitRanges & CPU/memory request/limit sizes and how you can combine them in an optimal way. Furthermore, you will see why it’s important to use node-pressure eviction policies, and how the Prometheus monitoring stack can help you to identify misconfigured workloads.

Allgemein

Philip Schmid

System Engineer

You just set up a Kubernetes cluster and are happy, everything runs smooth and picture perfect? Well done, congratulations, you managed to achieve a first success in your cloud-native journey!

Unfortunately, that’s often just the beginning. There are quite some essential day-2 operational tasks you need to take care of, so your Kubernetes journey will last for years and won’t end early in a nightmare. To be more precise, I’ll show you a possible way how to manage CPU and memory resources properly. As you will later see, other resources can be managed similarly.

Resources for Containers

First, I would like to highlight the most important resource related container configuration possibilities, called CPU/memory requests & limits. Yep, that’s right, “container configuration possibilities”, not “pod configuration possibilities”– resources must be defined at the container level. Here’s a simple example:

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "128Mi"
    cpu: "500m"

Hint: The CPU metric m stands for “millicores”. 1 CPU/vCPU = 1000 millicores. Can also be expressed as a float value (e.g. 0.1 = 100m).

As you can see, the resources dictionary has two child nodes: requests and limits:

requests define which amount of CPU/memory resources a container receives for sure. When a pod with one or multiple containers is started, the Kubernetes scheduler looks for a node with sufficient free resources to host this pod and its container(s). If there isn’t any such node available, the pod won’t get scheduled, and its status won’t surpass Pending.
limits define which amount of CPU/memory resources a container is allowed to use at maximum. The limit values are not considered by the Kubernetes scheduler when looking for the best fitting node. Also, a node can be overcommitted, and need not be able to fulfill all limits at once.
- If a container exceeds its CPU limit, its processes will be throttled.
- If a container exceeds its memory limit, it will get OOM killed and restarted.

Resources for Namespaces

We now know how to properly set CPU/memory request/limit sizes on containers, but what happens, for example, if developers did not set proper values, and their applications start eating up all free resources? Or if they were lazy and didn’t bother setting appropriate values at all? That’s where the Kubernetes platform engineering team needs to enforce some restrictions via LimitRanges and ResourceQuotas.

LimitRanges allow to set some default values for the requested resources. This way, one can ensure that each and every container gets CPU and/or memory request and/or limit size values assigned, even if the developers did not specify them explicitly. If request/limit sizes are actually specified, they take priority over the defaults in LimitRanges (more details).

Let’s have a look at an example which I would consider a good starting point for a production cluster. Please understand that these values need to be fine-tuned over time according to a certain workload’s behavior and the chosen node sizing.

apiVersion: v1
kind: LimitRange
metadata:
  name: default
  namespace: pitc-app-xy
spec:
  limits:
  - default:
      cpu: 100m
      memory: 256Mi
    defaultRequest:
      cpu: 20m
      memory: 16Mi
    type: Container

The official Kubernetes documentation included even more interesting use cases of LimitRanges: https://kubernetes.io/docs/concepts/policy/limit-range/

Next, I would like to highlight ResourceQuotas which for example allow to limit the total usage of compute resources from the workloads within a single namespace.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: default
  namespace: pitc-app-xy
spec:
  hard:
    requests.cpu: 1
    requests.memory: 2Gi
    limits.cpu: 4
    limits.memory: 8Gi

In the definition shown above, Kubernetes inhibits the start of further workloads if all current workloads inside the namespace pitc-app-xy already request a total of 2Gi memory–the new workload would then be stuck in Pending state.

Furthermore, I would like to show you another large benefit of ResourceQuotas: They could even help to limit the usage of certain API objects like Kubernetes pods, LoadBalancer services, PersistentVolumeClaims, etc. By doing so, you could for example mitigate the risk of a namespace consuming all available storage of a StorageClass or all IP addresses from an L4 LoadBalancer’s IP range (e.g. from MetalLB).

apiVersion: v1
kind: ResourceQuota
metadata:
  name: default
  namespace: pitc-app-xy
spec:
  hard:
    persistentvolumeclaims: "10"
    services.loadbalancers: "4"
    my-sc.storageclass.storage.k8s.io/requests.storage: 100Gi

Here’s an example of an unrelated verification command, which could even be used by the developers themselves to check current resource usage and hard limits:

$ kubectl describe resourcequota default -n pitc-app-xy
Name:            default
Namespace:       pitc-app-xy
Resource         Used   Hard
--------         ----   ----
limits.cpu       500m   8
limits.memory    512Mi  30Gi
pods             1      20
requests.cpu     10m    5
requests.memory  32Mi   20Gi

To summarize the use cases of LimitRanges and ResourceQuotas from the point of view of a Kubernetes platform engineering team:

If developers forget to configure CPU/memory request/limit size values, it won’t have that big of an impact on the cluster because the containers obtain automatically assigned values from the defined LimitRange.
If developers try to configure too large values for the CPU/memory request/limit sizes, it won’t have that big of an impact on cluster because the ResourceQuota limits the total usage of each resource per namespace.

Reserve Compute Resources for System / Kubernetes Daemons

Now that we’ve looked at the basics of a healthy compute resource sizing for workloads, I would like to also show how to make a cluster more resilient–especially in the case when the load on individual nodes becomes critical. As you probably know, Linux uses a component called Out of Memory (OOM)killer to terminate processes in case of memory shortage. This could also hit critical processes like kubelet, SSH daemon, etc., so we should try to avoid the “critical node memory available” threshold.

Luckily, Kubernetes knows concepts called “Node Allocatable” and “Eviction Thresholds”. They help reserve resources for system components, or even free up space by evicting pods in case the total resource consumption surpasses a certain threshold:

- Allocatable: Amount of compute resources that are available for pod execution. These values (CPU, memory & ephemeral-storage) basically equal the summarized values of all containers’ request sizes from pods which are running on this node.
- kube-reserved: Resource reservation for Kubernetes related system daemons like Kubelet, Kube-Proxy, container runtime, etc.
- system-reserved: Resource reservation for system related daemons like sshd, systemd-networkd, systemd-hostnamed, etc.
- eviction-threshold: Minimum amount of compute resources which should not be used by pods. This level acts as a threshold. When it’s surpassed (pods and/or Kubernetes/system components started using more and more resources), pods get evicted and (hopefully) rescheduled to other nodes (although there’s no guarantee that a suitable node is actually available).

The basic formula goes like this:

  [Allocatable] = [Available Node Resources] - [kube-reserved] - [system-reserved] - [eviction-hard]

The required configuration is set on the Kubelet itself. A small “cluster.yaml” configuration example from one of our Rancher RKE 1 clusters:

rancher_kubernetes_engine_config:
  services:
    kubelet:
      extra_args:
        eviction-hard: memory.available<500Mi
        kube-reserved: 'cpu=200m,memory=1Gi'
        system-reserved: 'cpu=200m,memory=1Gi'

On each node, we reserve 200m of CPU time and 1Gi of memory resources for the Kubernetes and system daemons. Furthermore, we instruct the Kubelet to start “hard” evicting pods (without any grace period) from the node as soon as only 500Mi of memory is left available before it would need to start to take memory space from the *-reserved reservations. If the Kubelet can’t evict pods fast enough, and the pods would start using the *-reserved spaces, pods start getting OOM killed.

Let’s have a look at the following example (it’s assumed, that the node has 32Gi memory available):

Pods which are evicted from a node will simply be rescheduled by the Kubernetes scheduler – (hopefully) on a different node with sufficient free resources.

Please keep in mind that the example values shown above might need to be adjusted depending on your node sizing, and system daemons running on your nodes. Also be aware that a change of these values requires a restart of the Kubelet (workload will be rescheduled – downtime!).

Resource Management & Prometheus Stack

In this final section, I would like to show how the Prometheus monitoring stack can help to track the compute resource usage over time, and how to further improve the configured request/limit size values.

Visibility

You have now learned why it’s important to limit the usable compute resources at different levels (container, namespace & node). That’s neat and helpful, but how do you track the actual usage of compute resources compared to the chosen request and limit sizes? Well, that’s exactly where the Prometheus monitoring stack, and especially Grafana comes into play. Using these technologies allows to build custom dashboards or even simply import already existing ones, which visualize this data. Here’s an example of the “Capacity Monitoring Cluster” dashboard (JSON definition):

capacity_monitoring_cluster_grafana_dashboard_example

The six percentage values at the top show the actual CPU/memory usage, and how much of the available CPU/memory resources are already requested/limited. Please keep in mind that an overcommitment of CPU limits often is not a problem – it depends on the specific type of workload.
The table in the middle shows a per namespace view of the 95th quantile of used CPU resources, the total of request/limit size of applications running inside this namespace, and a relative comparison between these two values. Values within some columns are even highlighted in different colors, so it’s easier to detect values, which should be adjusted.
The table at the bottom shows basically the same information as the one in the middle, but for memory instead of CPU resources.

When having a look at this dashboard from time to time, you will see which namespaces waste too many resources, or which ones would need a higher ResourceQuota. To get an even more detailed view about the applications within a certain namespace, click on the namespace name (“Drill down”), and you will be redirected to the “Compute Resources / Namespace (Pods)” dashboard that comes with the kube-prometheus-stack Helm chart by default.

This is merely a single, yet very helpful dashboard example from the kubernetes-mixin repository. I would highly recommend to also have a look at the others.

Alerting

Last but not least, I would like to also mention that the kube-prometheus-stack Helm chart ships with some very handy predefined Prometheus alert rules. There are for example the “KubeCPUOvercommit” and “KubeMemoryOvercommit” checks, which both would trigger if the respective resource request size is greater than the total cluster capacity minus one node (for resiliency in the case of a single node failure).

- alert: KubeCPUOvercommit
  annotations:
    description: Cluster has overcommitted CPU resource requests for Pods by {{ $value }} CPU shares and cannot tolerate node failure.
    runbook_url: 
    summary: Cluster has overcommitted CPU resource requests.
  expr: |
    sum(namespace_cpu:kube_pod_container_resource_requests:sum{}) - (sum(kube_node_status_allocatable{resource="cpu"}) - max(kube_node_status_allocatable{resource="cpu"})) > 0
    and
    (sum(kube_node_status_allocatable{resource="cpu"}) - max(kube_node_status_allocatable{resource="cpu"})) > 0
  for: 10m
  labels:
    severity: warning
- alert: KubeMemoryOvercommit
  annotations:
    description: Cluster has overcommitted memory resource requests for Pods by {{ $value }} bytes and cannot tolerate node failure.
    runbook_url: 
    summary: Cluster has overcommitted memory resource requests.
  expr: |
    sum(namespace_memory:kube_pod_container_resource_requests:sum{}) - (sum(kube_node_status_allocatable{resource="memory"}) - max(kube_node_status_allocatable{resource="memory"})) > 0
    and
    (sum(kube_node_status_allocatable{resource="memory"}) - max(kube_node_status_allocatable{resource="memory"})) > 0
  for: 10m
  labels:
    severity: warning

Apart from these two cluster level alerts, there are also 3 namespace-based alerts that trigger as soon as the ResourceQuota of a namespace reaches a critical level: “KubeQuotaAlmostFull”, “KubeQuotaFullyUsed” and “KubeQuotaExceeded”

- alert: KubeQuotaAlmostFull
  annotations:
    description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota.
    runbook_url: 
    summary: Namespace quota is going to be full.
  expr: |
    kube_resourcequota{job="kube-state-metrics", type="used"}
      / ignoring(instance, job, type)
    (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
      > 0.9 < 1
  for: 15m
  labels:
    severity: info
- alert: KubeQuotaFullyUsed
  annotations:
    description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota.
    runbook_url: 
    summary: Namespace quota is fully used.
  expr: |
    kube_resourcequota{job="kube-state-metrics", type="used"}
      / ignoring(instance, job, type)
    (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
      == 1
  for: 15m
  labels:
    severity: info
- alert: KubeQuotaExceeded
  annotations:
    description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota.
    runbook_url: 
    summary: Namespace quota has exceeded the limits.
  expr: |
    kube_resourcequota{job="kube-state-metrics", type="used"}
      / ignoring(instance, job, type)
    (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
      > 1
  for: 15m
  labels:
    severity: warning

Mehr Wissen