This guide explains how to set up Predictive Horizontal Pod Autoscaling in Kubernetes using Thoras, including the difference between recommendation and autonomous modes, and how to scale using external or custom metrics.

What is Traditional Horizontal Pod Autoscaling?

Horizontal Pod Autoscaling (HPA) is a Kubernetes feature that adjusts the number of pods in a deployment, statefulset, or other scalable resource based on specified metrics, which can include built-in metrics (CPU, memory) or external/custom metrics (e.g. queue length, latency). HPA increases or decreases the number of pods after metrics cross pre-defined thresholds

Limitations of Traditional HPA:

While traditional HPA is powerful, it only reacts to spikes in usage after they happen. This means your application may scale up too late to handle sudden surges in traffic, forcing you to keep extra capacity running “just in case”, which can be costly and inefficient.

Another common shortcoming with traditional HPA is that it is prone to rubber-banding if not carefully managed. For example, a scale down quickly followed by a scale back up, which can negatively impact both reliability and cost through pod churn.

How Thoras Improves Autoscaling:

Thoras brings Predictive Horizontal Pod Autoscaling to Kubernetes. Instead of waiting for a spike to occur, Thoras analyzes trends and forecasts future demand, scaling your pods before the spike even begins. This proactive approach gives workloads greater resilience and allows running at much higher utilization targets, eliminating the need to maintain a large buffer of spare capacity “just in case”. Ultimately, this can significantly reduce cloud waste while maximizing resilience to increases in usage.

In this guide, we will walk you through how to set up Predictive Horizontal Pod Autoscaling with Thoras so your pods can scale ahead of demand spikes—giving you both efficiency and peace of mind.

Thoras Scaling Modes: Recommendation vs. Autonomous

Thoras supports two modes for managing scaling:

Recommendation Mode

  • Thoras analyzes your workload usage, makes accurate predictions of your workload’s future usage, and suggests the optimal number of pods based on those predictions.
  • You (or your automation) can review and apply these recommendations.

Autonomous Mode

  • Thoras actively manages horizontal scaling for your workload by providing your HPA the ability to scale on future usage in addition to real-time usage. The number of desired pods for any given scaling decision will always be the higher number of suggested pods between real-time (provided by HPA) and future-time (provided by Thoras).

Setting Up Predictive HPA with Thoras

Prerequisites

1. Create an HPA (if not already present)

A single existing HPA is required for each target deployment, which is turbo-charged with Thoras.

Check whether an HPA exists

kubectl get hpa -n <namespace>

For Thoras to work autonomously, you must have an HPA defined for that workload. However, if you just want to view recommendations, Thoras can do so without an HPA by assuming you want suggestions that would scale you to a target of 80% utilization. Example defining HPA for AIScaleTarget:

kubectl autoscale deployment <deployment-name> -n <namespace> --cpu-percent=80 --min=1 --max=10

2. Register an AIScaleTarget

See Configuring AIScaleTargets for detailed steps on creating and applying an AIScaleTarget manifest.

3. Setting Horizontal Mode [ recommendation | autonomous ]

Recommendation Mode

AIScaleTargets should start in recommendation mode while the models learn usage and scaling patterns. To enable recommendation mode simply set the AIScaleTarget’s spec.horizontal.mode to recommendation. For example, in my-ast.yaml:

---
spec:
  horizontal:
    mode: recommendation

Apply the update

kubectl apply -f my-ast.yaml

Autonomous Mode

After models are sufficiently trained in recommendation mode, the mode may be switched to autonomous where scaling recommendations are automatically applied. To enable autonomous mode simply set the AIScaleTarget’s spec.horizontal.mode to autonomous. For example, in my-ast.yaml:

---
spec:
  horizontal:
    mode: autonomous

Apply the update

kubectl apply -f my-ast.yaml

Using External or Custom Metrics

By default, HPA can scale on CPU and memory. Thoras and Kubernetes also support scaling on external or custom metrics (e.g., queue writes, request rate). Note that only metrics independent from pod count should be used for Thoras scaling (Ex. queue length is not independent from pod count because more pods results in faster queue depletion.)

Example: Scaling on an External Metric Add an external metric to your AIScaleTarget:

apiVersion: thoras.ai/v1
kind: AIScaleTarget
metadata:
  name: {{ YOUR_AST_NAME }}
  namespace: {{ YOUR_NAMESPACE }}
spec:
  scaleTargetRef:
    kind: Deployment
    name: {{ YOUR_AST_NAME }}
  horizontal:
    mode: autonomous
    additionalMetrics:
      - type: External
        external:
          metric:
            name: {{ YOUR_CUSTOM_METRIC }}
            selector:
              matchLabels:
                source: prometheus
          target:
            type: AverageValue
            averageValue: 100
  • type: External — tells HPA to use an external metric.
  • external.metric.name — the name of the metric (must match what your metrics adapter provides).
  • external.target.averageValue — the target value for scaling.

Verifying HPA and Thoras Integration

  • Check HPA status:

    kubectl describe hpa <ast-name> -n <namespace>
    
  • Check AIScaleTarget status:

    kubectl get aiscaletarget <ast-name> -n <namespace> -o yaml
    
  • Monitor pod scaling events:

    kubectl get pods -w