When a container is killed due to an out-of-memory (OOM) event, Thoras can automatically increase the container’s memory request to stabilize the workload. This prevents a workload from cycling through repeated OOM kills before while waiting for a scheduled forecast to produce a sufficient memory recommendation. OOM remediation requires autonomous vertical scaling to be enabled. It is opt-in per workload.Documentation Index
Fetch the complete documentation index at: https://docs.thoras.ai/llms.txt
Use this file to discover all available pages before exploring further.
How It Works
When OOM kills are detected on a workload, Thoras:- Bumps memory requests for affected containers by a 1.2x multiplier, computed from the average current container memory request across all running pods.
- Applies the adjustment to running pods and any newly created pods. For
running pods, Thoras uses your
update_policy: it attempts an in-place resize first (if enabled), and falls back to a rolling restart (if enabled) when resize fails or is not supported. Any pods created or rescheduled in the meantime (e.g., evictions, scale-out) automatically receive the adjusted memory regardless of whether the resize or restart succeeds. - Repeats every 2 minutes as long as OOM kills continue, compounding each time (1.2×, then 1.44×, then 1.73×, and so on) until OOMs stop.
- Holds the memory floor for a configurable stabilization window after the last OOM. During this window, forecasts can raise memory above the floor but cannot lower it. This prevents a new forecast, which may not yet reflect the OOM episode, from reversing the adjustment down to a point which would be prone to OOMing.
- Returns full control to the forecaster once the stabilization window expires. If OOMs recur, the cycle begins again.
Enabling OOM Remediation
Addoom_remediation to your spec.vertical configuration:
oom_remediation.enabled must be true and spec.vertical.mode must be
autonomous. Workloads in recommendation mode are not affected.
Stabilization Window
After Thoras applies an OOM memory adjustment, it holds a memory floor for the duration of the stabilization window. During this window:- Incoming forecasts cannot lower memory below the adjusted value.
- Forecasts can raise memory above the adjusted value if the forecast recommends it.
- Each new OOM resets the stabilization window, extending it from the time of the most recent OOM. The workload must be stable for the configured interval before memory may me scaled down from the multiplied value.
max(forecast_interval, 1h) The window is at minimum one
full forecast cycle, ensuring the forecaster has had at least one opportunity to
observe the workload at the adjusted memory level before the floor is removed.
The stabilization window should be long enough to include a period where memory
usage reaches the amount of memory which would have needed to have been
prevented the previously observed OOMs, so that the forecaster is able to
account for the high memory usage.
To override the default, set stabilization_window explicitly:
Upper Bounds
Ifmemory.upperbound is configured on a container, OOM adjustments are clamped
to that value. Adjustments will not exceed the upper bound regardless of how
many times the multiplier compounds.
If no upper bound is configured, adjustments compound without a ceiling. For
workloads where runaway memory growth would be harmful, configure
memory.upperbound.
Interaction with Forecasts
During an active stabilization window, Thoras appliesmax(forecast, floor) to
memory when scaling. CPU is always sourced from the forecast unchanged.
When a new forecast arrives after the window has expired, the floor is cleared
and the forecast value is applied directly. If the workload OOMs again, the
remediation cycle restarts from the new forecast baseline.
Considerations
Memory may be held above the forecast during the stabilization window. If an OOM was caused by a one-off spike and the workload returns to lower memory usage afterward, requests will remain elevated until the window expires. Thoras favors avoiding another OOM over reclaiming memory immediately, and allows configuring the duration of the stabilization window. New pods after the window expires start at the forecast value. Once the floor clears, pods created from that point forward use the forecaster’s recommendation. If that recommendation is insufficient, another OOM may occur and trigger a new remediation cycle. Partial in-place resize failures. If some pods cannot be resized in place (e.g., the node lacks capacity), those pods continue at their previous memory until the node has room or they are rescheduled. The next adjustment cycle reads a blended average and targets a slightly higher value.Full Example
- OOM remediation is enabled with a 2-hour stabilization window
- Memory adjustments are clamped to 4 GiB (
upperbound) - Pods are resized in place where possible, with recreation as fallback