Predictive Horizontal Pod Autoscaling
This guide explains how to set up Predictive Horizontal Pod Autoscaling in Kubernetes using Thoras, including the difference between recommendation and autonomous modes, and how to scale using external or custom metrics.
What is Traditional Horizontal Pod Autoscaling?
Horizontal Pod Autoscaling (HPA) is a Kubernetes feature that adjusts the number of pods in a deployment, statefulset, or other scalable resource based on specified metrics, which can include built-in metrics (CPU, memory) or external/custom metrics (e.g. queue length, latency). HPA increases or decreases the number of pods after metrics cross pre-defined thresholds
Limitations of Traditional HPA:
While traditional HPA is powerful, it only reacts to spikes in usage after they happen. This means your application may scale up too late to handle sudden surges in traffic, forcing you to keep extra capacity running “just in case”, which can be costly and inefficient.
Another common shortcoming with traditional HPA is that it is prone to rubber-banding if not carefully managed. For example, a scale down quickly followed by a scale back up, which can negatively impact both reliability and cost through pod churn.
How Thoras Improves Autoscaling:
Thoras brings Predictive Horizontal Pod Autoscaling to Kubernetes. Instead of waiting for a spike to occur, Thoras analyzes trends and forecasts future demand, scaling your pods before the spike even begins. This proactive approach gives workloads greater resilience and allows running at much higher utilization targets, eliminating the need to maintain a large buffer of spare capacity “just in case”. Ultimately, this can significantly reduce cloud waste while maximizing resilience to increases in usage.
In this guide, we will walk you through how to set up Predictive Horizontal Pod Autoscaling with Thoras so your pods can scale ahead of demand spikes—giving you both efficiency and peace of mind.
Thoras Scaling Modes: Recommendation vs. Autonomous
Thoras supports two modes for managing scaling:
Recommendation Mode
- Thoras analyzes your workload usage, makes accurate predictions of your workload’s future usage, and suggests the optimal number of pods based on those predictions.
- You (or your automation) can review and apply these recommendations.
Autonomous Mode
- Thoras actively manages horizontal scaling for your workload by providing your HPA the ability to scale on future usage in addition to real-time usage. The number of desired pods for any given scaling decision will always be the higher number of suggested pods between real-time (provided by HPA) and future-time (provided by Thoras).
Setting Up Predictive HPA with Thoras
Prerequisites
- A running Kubernetes cluster with Kubernetes metrics server installed
- Thoras installed and running (see quickstart guide)
1. Create an HPA (if not already present)
A single existing HPA is required for each target deployment, which is turbo-charged with Thoras.
Check whether an HPA exists
For Thoras to work autonomously, you must have an HPA defined for that workload.
However, if you just want to view recommendations, Thoras can do so without an
HPA by assuming you want suggestions that would scale you to a target of 80%
utilization. Example defining HPA for AIScaleTarget
:
2. Register an AIScaleTarget
See Configuring AIScaleTargets for detailed steps on
creating and applying an AIScaleTarget
manifest.
3. Setting Horizontal Mode [ recommendation | autonomous ]
Recommendation Mode
AIScaleTarget
s should start in recommendation
mode while the models learn
usage and scaling patterns. To enable recommendation
mode simply set the
AIScaleTarget
’s spec.horizontal.mode
to recommendation
. For example, in
my-ast.yaml:
Apply the update
Autonomous Mode
After models are sufficiently trained in recommendation
mode, the mode may be
switched to autonomous
where scaling recommendations are automatically
applied. To enable autonomous
mode simply set the AIScaleTarget
’s
spec.horizontal.mode
to autonomous
. For example, in my-ast.yaml:
Apply the update
Using External or Custom Metrics
By default, HPA can scale on CPU and memory. Thoras and Kubernetes also support scaling on external or custom metrics (e.g., queue writes, request rate). Note that only metrics independent from pod count should be used for Thoras scaling (Ex. queue length is not independent from pod count because more pods results in faster queue depletion.)
Example: Scaling on an External Metric Add an external metric to your
AIScaleTarget
:
type: External
— tells HPA to use an external metric.external.metric.name
— the name of the metric (must match what your metrics adapter provides).external.target.averageValue
— the target value for scaling.
Verifying HPA and Thoras Integration
-
Check HPA status:
-
Check
AIScaleTarget
status: -
Monitor pod scaling events: