Kubernetes custom metric autoscaling: almost great

Custom metrics tend to be more accurate and useful than CPU- and RAM-based autoscaling, but scaling based on custom and external metrics is still a bit of an untamed frontier that could use improvement.

Kubernetes is a platform that’s constantly evolving. That’s why I occasionally take a fresh look at features that I haven’t worked with for a few years. I recently went through this exercise with Horizontal Pod Autoscalers (HPAs) and noticed some features of interest and limitations that are useful to know about, but aren’t well documented or obvious.

A “gotcha” worth noting is that HPA’s apiVersion: autoscaling/v2 implies that HPAs are a mature API. In reality, scaling based on custom and external metrics is still a bit of an untamed frontier that could use improvement. This is unfortunate, as custom metrics tend to be more accurate and useful than CPU- and RAM-based autoscaling.

Let’s explore a few use cases where custom metrics improve scalability and the tools to make it happen.

Use Cases

Once custom metrics have been set up and presented in an HPA-friendly format, HPAs can scale based on multiple metrics. The Kubernetes docs give a partial example of this, where an HPA is configured to scale based on CPU utilization, packets per second and requests per second. The idea is each metric can suggest a different number of desired replicas, like 3, 5 and 8. Then, the HPA can scale to the highest suggested count.
- The Kube Prometheus Stack Helm chart is one of the most popular and supported cloud-agnostic solutions for providing custom metrics. The stack contains several apps, and Prometheus Adapter for Kubernetes Metrics APIs is the component that can convert and publish Prometheus metrics into a format that HPAs can understand and use.
Incoming requests per second and request duration/latency are solid metrics for scaling web services.
- Traefik, Linkerd and Istio are commonly used for ingress, gateway and service mesh purposes. They each have Layer 7 proxies that can add many custom metrics that are of great use in optimizing autoscaling.
Architectures with queues or storage buckets, where objects to be processed are uploaded, can decide to autoscale services that process those objects based on the number of items detected.
- Kubernetes Event Driven Autoscaling (KEDA) has scalers that can interface with several queues ( like Pub/Sub) and object storage buckets. It also supports running queries against metric, log, SQL and NoSQL databases to generate metrics.
Variable Minimum Replicas can help balance the ability to support traffic spikes and keep down costs. If you’ve ever supported an app whose replicas are slow to start or that experiences massive traffic spikes, you’ve probably run into a need to increase your minimum replicas to maintain quality of service.
- For example: If one replica can support 100 reqs/sec, it could be configured to autoscale at 50 reqs/sec to avoid errors or lag seen above 100 reqs/sec. Having a minimum of 10 replicas would support traffic spikes better than a minimum of two.
- The only problem with this technique is that your costs go up, but it’s reasonable to think an app might have semi-predictable traffic spikes. Maybe a minimum of two would work outside of normal business hours; a minimum of 10 for hours when spikes are common and a minimum of five at other times.
- KEDA recently merged an option to plug multiple metrics into a user-customized formula to create composite metrics by plugging other metrics into math formulas. So, desired count = autoscaling metric + cron-based desired count can create the effect of having a variable minimum replica count.
Serverless and functions-as-a-service platforms hosted on Kubernetes, like the following:
- keda.sh
- knative.dev
- openfaas.com

Limitations and Consequences

All of the above sounds good, and we have tooling to pull it off. So, what’s this problematic limitation that’s preventing HPAs from being great?

The following limitation results in significant UX (user experience) consequences: https://github.com/kubernetes-sigs/custom-metrics-apiserver/issues/70

The limitation summarized is “There can only be one custom metrics server.” If you install the popular kube-prometheus-stack with default settings, you’ll get Prometheus Adapter. You can’t use KEDA’s keda-operator-metrics-apiserver, Knative’s Knative Pod Autoscaler, OpenFaaS Pro’s autoscaler, Datadog’s autoscaler or others. This is because they all work by hosting a custom metric server. I think this is why KEDA has enough scalers to give the impression they want to be the poster child of feature creep. KEDA has over 60 scalers, including ones for Prometheus and Datadog. This doesn’t make sense until you realize that part of the reason is to work around the limitation.

So why does this limitation degrade the UX? Let’s start with an example of what good UX looks like.

The Kube Prometheus stack and Helm are popular for a reason. They offer a great UX and part of their secret sauce is following the concept of “convention over configuration.” They can offer sensible default values and a default configuration where hundreds to thousands of YAML objects are pre-wired together according to conventions to achieve a turnkey UX where a lot of stuff works out of the box.

A prerequisite for making that magical UX happen is the ability to own a namespace that won’t conflict with other things, as then you can establish conventions in your namespace so things can be pre-wired together without needing handcrafted configuration.

Multiple proofs-of-concept (PoCs) have been made for this common problem, and even a Kubernetes enhancement proposal was created to solve it; that fizzled out. It’s my hope that this article shines light on this problem and encourages custom metric scaling APIs to get some renewed interest.