Skip to main content
The specification describes the AIScaleTarget Custom Resource that defines how Thoras scales workloads. The following is a sample AIScaleTarget definition:
ast.yaml
apiVersion: thoras.ai/v1
kind: AIScaleTarget
metadata:
  name: "{{ YOUR_AST_NAME }}"
  namespace: "{{ YOUR_NAMESPACE }}"
spec:
  scaleTargetRef:
    kind: Deployment
    name: "{{ YOUR_AST_NAME }}"
  model:
    forecast_blocks: 15m
    forecast_buffer_percentage: 0%
    forecast_cron: "*/15 * * * *"
  horizontal:
    mode: recommendation
    scaling_behavior:
      scale_up:
        type: percent
        percent: 50
      scale_down:
        type: percent
        percent: 50
  vertical:
    containers:
      - name: "{{ CONTAINER_NAME }}"
        cpu:
          lowerbound: 20m
          upperbound: 1
        memory:
          lowerbound: 50Mi
          upperbound: 2G
    mode: recommendation
    scaling_behavior:
      scale_up:
        type: percent
        percent: 50
      scale_down:
        type: percent
        percent: 50
    update_policy:
      update_mode: in_place_or_recreate

metadata

ast.yaml
metadata:
 name: {{ YOUR_AST_NAME }}
  namespace: {{ YOUR_NAMESPACE }}
It’s recommended that metadata.name matches the name of the workload being scaled.

scaleTargetRef

ast.yaml
scaleTargetRef:
  kind: Deployment
  name: {{YOUR_AST_NAME}}
  namespace: {{YOUR_NAMESPACE}}
  apiVersion: apps/v1
scaleTargetRef is the reference to a specific workload that Thoras will scale up or down. The workload must reside in the same namespace as the AIScaleTarget. apiVersion specifies the API group and version that defines the resource, while kind identifies the type of resource (e.g., Deployment, StatefulSet, Rollout, etc). Note: scaleTargetRef and selector are mutually exclusive. Use scaleTargetRef to target a single workload by name, or use selector to target multiple workloads by label.

selector

ast.yaml
selector:
  matchLabels:
    app: my-service
    environment: production
selector allows you to target multiple workloads by pod labels instead of specifying a single workload by name. This is useful when you want to apply the same scaling policy to multiple workloads that share common labels. Important constraints:
  • selector and scaleTargetRef are mutually exclusive—you must use one or the other, not both.
  • When using selector, horizontal scaling is not supported. You must leave spec.horizontal empty or omit it entirely.
  • Pods must be managed by a Deployment, StatefulSet, or Argo Rollout. Standalone pods or other controller types are not supported.
How it works: When a selector is defined, Thoras identifies all pod controllers (Deployments, StatefulSets, Rollouts) in the same namespace whose pods match the label selector. In autonomous mode:
  • Thoras restarts each matching pod controller one by one when a resource request suggestion comes in from the forecaster.
  • If update_policy.update_mode is set to in_place or in_place_or_recreate, pods are resized in place without restarts (when possible).

model

ast.yaml
model:
  forecast_blocks: 15m
  forecast_buffer_percentage: 0%
  forecast_cron: "*/15 * * * *"

model.forecast_blocks

Describes how far into the future Thoras’ scaler should prepare you for. For example, if forecast_blocks is 15m Thoras will forecast the maximum load of this workload over the next 15 minutes and then scale to that maximum. Important: forecast_blocks should be at least equal to your forecast_cron interval. For example, if your forecast_cron is set to run every 15m, your forecast_blocks should be at least 15m. This ensures forecasts always cover the next scaling window and prevents gaps in scaling coverage. You will need to specify either the “m” (for minutes) or “h” (for hours) unit for the value of model.forecast_blocks.

model.forecast_buffer_percentage

Defines an additional buffer applied on top of the forecasted resource usage to reduce the risk of under-provisioning. For example, if the forecasted CPU usage is 500m and forecast_buffer_percentage is set to 10%, the final recommended value will be 550m. Setting this to 0% means no buffer will be added.

model.forecast_cron

The forecast_cron setting determines how often forecasts are generated and, consequently, how often Thoras may trigger scaling events. Shorter intervals (e.g., every 5 minutes) can lead to more accurate forecasts but may also result in more frequent scaling. forecast_cron controls how often forecasts are made, while model.forecast_blocks defines the time window each forecast covers.

horizontal

ast.yaml
horizontal:
  mode: recommendation
Thoras horizontal scaling adjusts the number of replicas based on forecasted or observed workload demands. The mode field controls whether Thoras provides recommendations or actively scales your workload: In autonomous mode:
  • Horizontally-scaled workloads must have CPU and/or memory requests defined in the pod spec if you plan to scale on averageUtilization.
  • The target deployment of your AIScaleTarget must have a single existing Horizontal Pod Autoscaler (HPA).
    Note: Only one scaling direction (horizontal or vertical) can be in autonomous mode at a time. See Understanding Vertical and Horizontal Scaling Modes for additional details.
In recommendation mode :
  • System does not require an existing HPA. Thoras will assume an 80% utilization target for CPU and memory when generating scaling suggestions.
Thoras does not replace your HPA—it works alongside it by feeding forecasted metrics into your existing HPA configuration. You can customize behavior by:
  • Opting in additional metrics using spec.horizontal.additional_metrics
  • Opting out of default metrics using spec.horizontal.exclude_metrics
This setup gives you full control over how scaling decisions are made while benefiting from Thoras’ predictive intelligence. To enable horizontal scaling recommendations, set spec.horizontal.mode to recommendation. It is recommended to start in recommendation mode to allow the model time to learn workload patterns and validate scaling suggestions. Once the recommendations align with your performance expectations, you can switch to autonomous mode for automated scaling. Feel free to reach out to the Thoras Engineering Team to discuss model performance before switching into autonomous mode. Visit Predictive Horizontal Pod Autoscaling with Thoras Guide for more info.

vertical

ast.yaml
vertical:
  containers:
    - name: {{CONTAINER_NAME}}
      cpu:
        lowerbound: 20m # Mandatory
        upperbound: 1 # Optional
      memory:
        lowerbound: 50Mi # Mandatory
        upperbound: 2G # Optional
  mode: recommendation
  update_policy:
    update_mode: in_place_or_recreate
The Thoras vertical scaler adjusts container-level CPU and memory resource requests based on forecasted utilization. The mode field controls whether Thoras provides recommendations or actively scales your workload: To define a vertical scaling policy, you’ll want to set the following fields:
  • vertical.mode: recommendation - we recommend running Thoras in recommendation mode for at least a day for the model to train before enabling autonomous.
  • vertical.containers[0].memory.lowerbound is always required and vertical.containers[0].memory.upperbound is optional if you had a preference for a memory floor and ceiling for your target workload.
  • vertical.containers[].RESOURCE.limit allows Thoras to modify the container’s limit along with the request for this resource and is optional. If unset, Thoras will not modify limits for this resource.
  • vertical.containers[].RESOURCE.limit.ratio the ratio used to keep the limit inline with the suggested request for this resource. E.g. for a ratio 2.5 Thoras will make the limit 2.5G if the suggestion for the request was for 1G.
    Note: Only one scaling direction (horizontal or vertical) can be in autonomous mode at a time. See Understanding Vertical and Horizontal Scaling Modes for details on how vertical and horizontal modes interact.
Feel free to reach out to the Thoras Engineering Team to discuss model performance before switching into autonomous mode. Visit Predictive Vertical Pod Rightsizing Guide for more info.

update_policy

vertical:
  update_policy:
    update_mode: in_place_or_recreate
    recreate_resources: ["memory"]
The update_policy field controls pod update behavior during vertical scaling operations. Prerequisites for in-place updates:
  • Kubernetes 1.33+
Configuration:
  • update_mode (string): Determines how pod updates are applied:
    • recreate - Pods are recreated when resource changes are applied
    • in_place - Changes applied without pod recreation (requires Kubernetes 1.33+)
    • in_place_or_recreate - Attempts in-place updates with fallback to recreation (recommended)
    • initial - Recommendations apply only during pod creation
  • recreate_resources (optional array): Specifies which resources (memory, cpu) trigger pod recreation when using recreate or fallback modes
How it works:
  1. When update_mode is in_place or in_place_or_recreate, Thoras attempts to resize pods in place using the Kubernetes resize API
  2. When update_mode is in_place_or_recreate and the pod’s QoS class would change or the resize operation is not supported, Thoras falls back to evicting the pod
  3. Evicted pods are rolled out gradually rather than all at once, minimizing service disruption
Benefits:
  • Reduced disruption: Pods are not recreated unless necessary when using in-place updates
  • Faster scaling: Resource adjustments take effect immediately
  • Better node utilization: Enables more efficient bin-packing and cost savings, especially when combined with node autoscaling tools like Karpenter
  • Graceful fallback: Automatic eviction when in-place resizing is not possible
  • Fine-grained control: Specify which resources trigger pod recreation

additional_configuration

scaling_behavior

Controls scaling thresholds to prevent unnecessary scaling actions. Thresholds can be configured independently for scale_up and scale_down. Both percentage and absolute thresholds can be used for horizontal and vertical scaling. Percentage Thresholds - Trigger scaling when change exceeds a percentage of current usage (works for both horizontal and vertical):
horizontal:
  scaling_behavior:
    scale_up:
      type: percent
      percent: 20 # Scale up when increase is at least 20%
    scale_down:
      type: percent
      percent: 15

vertical:
  scaling_behavior:
    scale_up:
      type: percent
      percent: 20 # Scale up when increase is at least 20%
    scale_down:
      type: percent
      percent: 15
Absolute Thresholds - Trigger scaling when change exceeds fixed amounts (works for both horizontal and vertical): For horizontal scaling, use pods:
horizontal:
  scaling_behavior:
    scale_up:
      type: absolute
      absolute:
        pods: "5" # Scale up only if adding at least 5 pods
    scale_down:
      type: absolute
      absolute:
        pods: "3" # Scale down only if removing at least 3 pods
For vertical scaling, use memory and cpu:
vertical:
  scaling_behavior:
    scale_up:
      type: absolute
      absolute:
        memory: "100Mi" # Scale up only if memory increase is at least 100Mi
        cpu: "500m" # Scale up only if CPU increase is at least 500m
    scale_down:
      type: absolute
      absolute:
        memory: "50Mi" # Scale down only if memory decrease is at least 50Mi
        cpu: "250m" # Scale down only if CPU decrease is at least 250m
Best Practices:
  • Use absolute thresholds for small workloads (percentage changes can be misleading)
  • Use percentage thresholds for large workloads (scales naturally with deployment size)
  • Set asymmetric thresholds (conservative scale-up, aggressive scale-down)
  • Start with higher thresholds (20-30%) to avoid scaling churn

additional_metrics and exclude_metrics

ast.yaml
horizontal:
  mode: recommendation
  additional_metrics:
    - external:
        metric:
          name: {{EXTERNAL_METRIC_NAME}}
          selector:
            matchLabels:
              app: {{APP_NAME}}
        target:
          averageValue: "1"
          type: AverageValue
      type: External
  exclude_metrics:
    - name: cpu
    - type: Resource
When you enable horizontal scaling in Thoras, it defaults to using the same metrics as your HPA (Horizontal Pod Autoscaler) to guide predictive scaling, whether they’re resource, external or custom metrics. However, there may be cases where you want Thoras to:
  • Exclude certain metrics (e.g., CPU) from its predictions, even if they are used by the HPA.
  • Add new metrics for prediction that are not part of the HPA.

Example 1: Exclude CPU from Predictive Scaling

This configuration tells Thoras not to use CPU for predictive scaling, even if it’s included in the HPA:
ast.yaml
spec:
  horizontal:
    exclude_metrics:
      - name: cpu
        type: Resource

Example 2: Add Custom Metric for Predictive Scaling

In this example, custom metric is configured to predictively scale on CPU by explicitly including it in additional_metrics. Thoras will predictively scale on CPU even though it is not defined in the HPA.
ast.yaml
spec:
  horizontal:
    additional_metrics:
      - external:
          metric:
            name: {{EXTERNAL_METRIC_NAME}}
            selector:
              matchLabels:
                app: {{APP_NAME}}
          target:
            averageValue: "1"
            type: AverageValue
        type: External