container-startup-autoscaler

container-startup-autoscaler 🚀

container-startup-autoscaler (CSA) is a Kubernetes controller that modifies the CPU and/or memory resources of containers depending on whether they’re starting up, according to the startup/post-startup settings you supply. CSA works at the pod level and is agnostic to how the pod is managed; it works with deployments, statefulsets, daemonsets and other workload management APIs.

An overview diagram of CSA showing when target containers are scaled

CSA is implemented using controller-runtime.

CSA is built around Kube’s In-place Update of Pod Resources feature, which is currently in alpha state as of Kubernetes 1.29 and therefore requires the InPlacePodVerticalScaling feature gate to be enabled. Beta/stable targets are indicated here. The feature implementation (along with the corresponding implementation of CSA) is likely to change until it reaches stable status. See CHANGELOG.md for details of CSA versions and Kubernetes version compatibility.

⚠️ This controller should currently only be used for preview purposes on local or otherwise non-production Kubernetes clusters.

container-startup-autoscaler 🚀

Demo Video

A local sandbox is provided for previewing CSA - this video shows fundamental CSA operation using the sandbox scripts:

https://github.com/ExpediaGroup/container-startup-autoscaler/assets/76996781/fcea0175-4f09-43d3-9bad-de5aed8806f2

Docker Images

Versioned multi-arch Docker images are available via Docker Hub.

Helm Chart

A CSA Helm chart is available - please see its README.md for more information.

Motivation

The release of Kubernetes 1.27.0 introduced a new, long-awaited alpha feature: In-place Update of Pod Resources. This feature allows pod container resources (requests and limits) to be updated in-place, without the need to restart the pod. Prior to this, any changes made to container resources required a pod restart to apply.

A historical concern of running workloads within Kubernetes is how to tune container resources for workloads that have very different resource utilization characteristics during two core phases: startup and post-startup. Given the previous lack of ability to change container resources in-place, there was generally a tradeoff for startup-heavy workloads between obtaining good (and consistent) startup times and overall resource wastage, post-startup:

Employ Burstable QoS

Set limits greater than requests in the hope that resources beyond requests are actually scavengeable during startup.

Startup time is unpredictable since it’s dependent on cluster node loading conditions.
Post-startup performance may also be unpredictable as additional scavengeable resources are volatile in nature (particularly with cluster consolidation mechanisms).

Employ Guaranteed QoS (1)

Set limits the same as requests, with startup time as the primary factor in determining the value.

Startup time and post-startup performance is predictable but wastage may occur, particularly if the pod replica count is generally more than it needs to be.

Employ Guaranteed QoS (2)

Set limits the same as requests, with normal workload servicing performance as the primary factor in determining the value.

Post-startup performance is predictable and acceptable, but startup time is slower - this negatively affects desirable operational characteristics such as by elongating deployment durations and horizontal scaling reaction times.

The core motivation of CSA is to leverage the new In-place Update of Pod Resources Kube feature to provide workload owners with the ability to configure container resources for startup (in a guaranteed fashion) separately from normal post-startup workload resources. In doing so, the tradeoffs listed above are eliminated and the foundations are laid for:

Reducing resource wastage by facilitating separate settings for two fundamental workload phases.
Faster and more predictable workload startup times, promoting desirable operational characteristics.

How it Works

CSA is able to target a single non-init/ephemeral container within a pod. Configuration such as the target container name and desired startup/post-startup resource settings are contained within a number of pod annotations.

CSA watches for changes in pods that are marked as eligible for scaling (via a label). Upon processing an eligible pod’s changes, CSA examines the current state of the target container and takes one of several actions based on that state:

Startup resource settings are commanded (the target container currently has its post-startup settings applied and isn’t started).
Post-startup resource settings are commanded (the target container currently has its startup settings applied and is started).
The status of a previously commanded scale is determined and appropriately reported upon. If the commanded scale was successful, the scale is considered to be enacted.

CSA will react when the target container is initially created (by its pod) and if Kube restarts the target container.

CSA will not perform any scaling action if it doesn’t need to - for example, if the target container repeatedly fails to start prior to it becoming ready (with Kube reacting with restarts in a CrashLoopBackOff manner), CSA will only apply startup resources once.

CSA generates metrics and pod Kube events, along with a detailed status that’s included within an annotation of the scaled pod.

Limitations

The following limitations are currently in place:

Originally admitted target container resources must be guaranteed (requests == limits) to match the guaranteed nature of startup resources - Kube API currently rejects any change in resource QoS. This should be addressed as the In-place Update of Pod Resources feature matures.
Post-startup resources must be guaranteed (requests == limits) to match the guaranteed nature of startup resources per above.
Failed target container scales are not re-attempted.

Restrictions

The following restrictions are currently in place and enforced where applicable:

Only a single container of a pod can be targeted for scaling.
The target pod must not be controlled by a VPA.
The target container post-startup requests must be lower than startup resources.
The target container must specify requests for both CPU and memory.
The target container must specify the NotRequired resize policy for both CPU and memory.
The target container must specify a startup or readiness probe (or both).

Scale Configuration

Labels

The following labels must be present in the pod that includes your target container:

Name	Value	Description
`csa.expediagroup.com/enabled`	`"true"`	Indicates a container in the pod is eligible for scaling - must be `"true"`.

Annotations

The following annotations must be present in the pod that includes your target container:

Name	Example Value	Description
`csa.expediagroup.com/target-container-name`	`"mycontainer"`	The name of the container to target.
`csa.expediagroup.com/cpu-startup`	`"500m"`*	Startup CPU (applied to both `requests` and `limits`).
`csa.expediagroup.com/cpu-post-startup-requests`	`"250m"`*	Post-startup CPU `requests`.
`csa.expediagroup.com/cpu-post-startup-limits`	`"250m"`*	Post-startup CPU `limits`.
`csa.expediagroup.com/memory-startup`	`"500M"`*	Startup memory (applied to both `requests` and `limits`).
`csa.expediagroup.com/memory-post-startup-requests`	`"250M"`*	Post-startup memory `requests`.
`csa.expediagroup.com/memory-post-startup-limits`	`"250M"`*	Post-startup memory `limits`.

&ast; Any CPU/memory form listed here can be used.

Probes

CSA needs to know when the target container is starting up and therefore requires you to specify an appropriately configured startup or readiness probe (or both).

If the target container specifies a startup probe, CSA always uses Kube’s started signal of the container’s status to determine whether the container is started. Otherwise, if only a readiness probe is specified, CSA primarily uses the ready signal of the container’s status to determine whether the container is started.

It’s preferable to have a startup probe defined since this unambiguously indicates whether a container is started whereas only a readiness probe may indicate other conditions that will cause unnecessary scaling (e.g. the readiness probe transiently failing post-startup).

Kube’s container status started and ready signal behavior is as follows:

When only a startup probe is present:

started is false when the container is (re)started and true when the startup probe succeeds.
ready is false when the container is (re)started and true when started is true.

When only a readiness probe is present:

started is false when the container is (re)started and true when the container is running and has passed the postStart lifecycle hook.
ready is false when container is (re)started and true when the readiness probe succeeds.

When both startup and readiness probes are present:

started is false when container is (re)started and true when the startup probe succeeds.
ready is false when container is (re)started and true when the readiness probe succeeds.

Status

CSA reports its status in JSON via the csa.expediagroup.com/status annotation. You can retrieve and format the status using kubectl and jq as follows:

kubectl get pod <name> -n <namespace> -o=jsonpath='{.items[0].metadata.annotations.csa\.expediagroup\.com\/status}' | jq

Example output:

{
  "status": "Post-startup resources enacted",
  "states": {
    "startupProbe": "true",
    "readinessProbe": "true",
    "container": "running",
    "started": "true",
    "ready": "false",
    "resources": "poststartup",
    "allocatedResources": "containerrequestsmatch",
    "statusResources": "containerresourcesmatch"
  },
  "scale": {
    "lastCommanded": "2023-09-14T08:18:44.174+0000",
    "lastEnacted": "2023-09-14T08:18:45.382+0000",
    "lastFailed": ""
  },
  "lastUpdated": "2023-09-14T08:18:45+0000"
}

Explanation of status items:

Item	Sub Item	Description
`status`	-	Human-readable status. Any validation errors are indicated here.
`states`	-	The states of the target container.
`states`	`startupProbe`	Whether a startup probe exists.
`states`	`readinessProbe`	Whether a readiness probe exists.
`states`	`container`	The container status e.g. `waiting`, `running`.
`states`	`started`	Whether the container is signalled as started by Kube.
`states`	`ready`	Whether the container is signalled as ready by Kube.
`states`	`resources`	The type of resources (startup/post-startup) that are currently applied (but not necessarily enacted).
`states`	`allocatedResources`	How the reported container allocated resources relate to container requests.
`states`	`statusResources`	How the reported currently allocated resources relate to container resources.
`scale`	-	Information around scaling activity.
`scale`	`lastCommanded`	The last time a scale was commanded (UTC).
`scale`	`lastEnacted`	The last time a scale was enacted (UTC; empty if failed).
`scale`	`lastFailed`	The last time a scale failed (UTC; empty if enacted).
`lastUpdated`	-	The last time this status was updated.

Events

The following Kube events for the pod that houses the target container are generated:

Normal Events

Warning Events

Logging

CSA uses the logr API with zerologr to log JSON-based error-, info-, debug- and trace-level messages.

When configuring verbosity, info-level messages have a verbosity (v) of 0, debug-level messages have a v of 1, and debug-level messages have a v of 2 - this is mapped via zerologr. Regardless of configured logging verbosity, error-level messages are always emitted.

Example info-level log:

{
	"level": "info",
	"controller": "container-startup-autoscaler",
	"namespace": "echoserver",
	"name": "echoserver-5f65d8f65d-mvqt8",
	"reconcileID": "6157dd49-7aa9-4cac-bbaf-a739fa48cc61",
	"targetname": "echoserver",
	"targetstates": {
		"startupProbe": "true",
		"readinessProbe": "true",
		"container": "running",
		"started": "true",
		"ready": "false",
		"resources": "poststartup",
		"allocatedResources": "containerrequestsmatch",
		"statusResources": "containerresourcesmatch"
	},
	"caller": "container-startup-autoscaler/internal/pod/targetcontaineraction.go:472",
	"time": 1694681974425,
	"message": "post-startup resources enacted"
}

Each message includes a number of keys that originate from controller-runtime and zerologr. CSA-added values include:

targetname: the name of the container to target.
targetstates: the states of the target container, per status.

Regardless of configured logging verbosity, error-level messages are always displayed and additionally include a stack trace key (stacktrace), if available.

Metrics

Additional CSA-specific metrics are registered to the Prometheus registry exposed by controller-runtime and exposed on port 8080 and path /metrics e.g. http://localhost:8080/metrics. CSA metrics are not pre-initialized with 0 values.

Reconciler

Prefixed with csa_reconciler_:

Metric Name	Type	Labels	Description
`skipped_only_status_change`	Counter	`controller`	Number of reconciles that were skipped because only the scaler controller status changed.
`existing_in_progress`	Counter	`controller`	Number of attempted reconciles where one was already in progress for the same namespace/name (results in a requeue).
`failure_unable_to_get_pod`	Counter	`controller`	Number of reconciles where there was a failure to get the pod (results in a requeue).
`failure_pod_doesnt_exist`	Counter	`controller`	Number of reconciles where the pod was found not to exist (results in failure).
`failure_validation`	Counter	`controller`	Number of reconciles where there was a failure to validate (results in failure).
`failure_states_determination`	Counter	`controller`	Number of reconciles where there was a failure to determine states (results in failure).
`failure_states_action`	Counter	`controller`	Number of reconciles where there was a failure to action the determined states (results in failure).

Labels:

controller: the CSA controller name.

Scale

Prefixed with csa_scale_:

Metric Name	Type	Labels	Description
`failure`	Counter	`controller`, `direction`, `reason`	Number of scale failures.
`commanded_unknown_resources`	Counter	`controller`	Number of scales commanded upon encountering unknown resources (see here).
`duration_seconds`	Histogram	`controller`, `direction`, `outcome`	Scale duration (from commanded to enacted).

Labels:

controller: the CSA controller name.
direction: the direction of the scale - up/down.
reason: the reason why the scale failed.
outcome: the outcome of the scale - success/failure.

Kube API Retry

Prefixed with csa_retrykubeapi_:

Metric Name	Type	Labels	Description
`retry`	Counter	`controller`, `reason`	Number of Kube API retries.

Labels:

controller: the CSA controller name.
reason: the Kube API response that caused a retry to occur.

See below for more information on retries.

Retry

Kube API

Unless Kube API reports that a pod is not found upon trying to retrieve it, all Kube API interactions are subject to retry according to CSA retry configuration.

CSA handles situations where Kube API reports a conflict upon a pod update. In this case, CSA retrieves the latest version of the pod and reapplies the update, before trying again (subject to retry configuration).

Encountering Unknown Resources

By default, CSA will yield an error if it encounters resources applied to a target container that it doesn’t recognize i.e. resources other than those specified within the pod startup or post-startup resource annotations. This may occur if resources are updated by an actor other than CSA. To allow corrective scaling upon encountering such a condition, set the --scale-when-unknown-resources configuration flag to true.

When enabled and upon encountering such conditions, CSA will:

Command startup/post-startup resources according to whether the container is started.
Append the Kube startup/post-startup resources commanded event reason and log message with (unknown resources applied)
Increment the commanded_unknown_resources metric.
Treat enacted startup resources as directionally scaled down within the failure and duration_seconds (as applicable) metrics.
Treat enacted post-startup resources as directionally scaled up within the failure and duration_seconds (as applicable) metrics.

CSA Configuration

CSA uses the Cobra CLI library and exposes a number of optional configuration flags. All configuration flags are always logged upon CSA start.

Controller

Retry

Log

Pod Admission Considerations

Upon pod cluster admission, CSA will attempt to upscale the target container to its startup configuration. Upscaling success depends on node loading conditions - it’s therefore possible that the scale is delayed or fails altogether, particularly if a cluster consolidation mechanism is employed.

In order to mitigate the effects of initial startup upscaling, it’s recommended to admit pods with the target container startup configuration already applied - CSA will not need to initially upscale in this case. Once startup has completed, the subsequent downscale to apply post-startup resources is significantly less likely to fail since it’s not subject to node loading conditions. In addition, any failure mode results in overall resource over-provisioning rather than startup under-provisioning.

It’s important to note that in either case, CSA will need to upscale if Kube restarts the target container.

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    metadata:
      labels:
        csa.expediagroup.com/enabled: "true"
      annotations:
        csa.expediagroup.com/target-container-name: target-container
        csa.expediagroup.com/cpu-startup: 500m
        csa.expediagroup.com/cpu-post-startup-requests: 100m
        csa.expediagroup.com/cpu-post-startup-limits: 100m
        csa.expediagroup.com/memory-startup: 500M
        csa.expediagroup.com/memory-post-startup-requests: 100M
        csa.expediagroup.com/memory-post-startup-limits: 100M
    spec:
      containers:
      - name: target-container
        resources:
          limits:
            cpu: 500m    # Admitted with csa.expediagroup.com/cpu-startup value 
            memory: 500M # Admitted with csa.expediagroup.com/memory-startup value
          requests:
            cpu: 500m    # Admitted with csa.expediagroup.com/cpu-startup value
            memory: 500M # Admitted with csa.expediagroup.com/memory-startup value

Container Scaling Considerations

Please consider carefully whether it’s appropriate to scale memory during execution of your container. Memory management differs between runtimes, and it’s not necessarily possible to change any runtime configuration (e.g. limits) set at the point of admission without restarting the container. Some runtimes may also default memory management settings based on available resources, which may no longer be optimal when memory is scaled.

In addition, some languages/frameworks may default configuration of concurrency mechanisms (e.g. thread pools) based on available CPU resources - this should be taken into consideration if applicable.

Best Practices

Define a startup probe since this unambiguously indicates whether a container is started.
Admit pods with target container startup resources specified.
Try to minimize restarts of target containers for causes within your control.
Try to minimize the startup time of your workload through profiling and optimization where possible.
Try to minimize the difference between startup resources and post-startup resources - in general, the bigger the difference, the less likely an upscale is to succeed (particularly when a cluster consolidation mechanism is employed).

Tests

Unit

Unit tests can be run by executing make test-run-unit from the root directory.

Integration

Integration tests can be run by executing make test-run-int or make test-run-int-verbose (verbose logging) from the root directory. Please ensure you’re using a version of Go that’s at least that of the version that’s indicated at the top of go.mod.

Integration tests are implemented as Go tests and located in test/integration. During initialization of the tests, a kind cluster is created (with a specific name); CSA is built via Docker and run via Helm. Tools are not bundled with the tests, so you must have the following installed locally (test development versions indicated):

Docker (24.0.6)
Helm (3.13.1)
kind (0.20.0)
kubectl (1.27.2)

The integration tests use echo-server for containers. Note: the very first execution might take some time to complete.

A number of environment variable-based configuration options are available:

Name	Default	Description
`MAX_PARALLELISM`	`4`	The maximum number of tests that can run in parallel.
`REUSE_CLUSTER`	`false`	Whether to reuse an existing CSA kind cluster (if it already exists).
`INSTALL_METRICS_SERVER`	`false`	Whether to install metrics-server.
`KEEP_CSA`	`false`	Whether to keep the CSA installation after tests finish.
`KEEP_CLUSTER`	`false`	Whether to keep the CSA kind cluster after tests finish.
`DELETE_NS_AFTER_TEST`	`true`	Whether to delete namespaces created by tests after they conclude.

Integration tests are executed in parallel due to their long-running nature. Each test operates within a separate Kube namespace (but using the same single CSA installation). If local resources are limited, reduce MAX_PARALLELISM accordingly and ensure DELETE_NS_AFTER_TEST is true. Each test typically spins up 2 pods, each with 2 containers; see source for resource allocations.

Running Locally

A number of Bash scripts are supplied in the scripts/sandbox directory that allow you to try out CSA using echo-server. The scripts are similar in nature to the setup/teardown work performed in the integration tests and have the same local tool requirements. Please ensure you’re using a version of Go that’s at least that of the version that’s indicated at the top of go.mod. Note: the kind cluster created by the scripts is named differently to the integration tests such that both can exist in parallel, if desired.

Cluster/CSA Installation

Executing csa-install.sh:

Removes any pre-existing CSA kind cluster.
Installs a CSA kind cluster.
Creates a new, separate CSA kind cluster kubeconfig file under $HOME/.kube/.
Pulls metrics-server, loads the image into the CSA kind cluster and installs.
Pulls echo-server and loads the image into the CSA kind cluster.
Builds CSA and loads the image into the CSA kind cluster.
Runs CSA via the Helm chart.
- Leader election is enabled; 2 pods are created.
- Log verbosity level is 2 (trace).

Note: the very first execution might take some time to complete.

Tailing CSA Logs

Executing csa-tail-logs.sh tails logs from the current CSA leader pod.

Getting CSA Metrics

Executing csa-get-metrics.sh gets metrics from the current CSA leader pod.

Watching CSA Status and Enacted Container Resources

Executing echo-watch.sh watches the CSA status for the pod created below along with the target container’s enacted resources.

(Re)installing echo-service

Execute echo-reinstall.sh to (re)install echo-service with a specific probe configuration contained within the echo directory structure:

Admit with post-startup resources (initial upscale required):

echo-reinstall.sh echo/post-startup-resources/startup-probe.yaml: single replica/container deployment with startup probe only.
echo-reinstall.sh echo/post-startup-resources/readiness-probe.yaml: single replica/container deployment with readiness probe only.
echo-reinstall.sh echo/post-startup-resources/both-probes.yaml: single replica/container deployment with both startup and readiness probes.

Admit with startup resources (initial upscale not required):

echo-reinstall.sh echo/startup-resources/startup-probe.yaml: single replica/container deployment with startup probe only.
echo-reinstall.sh echo/startup-resources/readiness-probe.yaml: single replica/container deployment with readiness probe only.
echo-reinstall.sh echo/startup-resources/both-probes.yaml: single replica/container deployment with both startup and readiness probes.

To simulate workload startup/readiness, initialDelaySeconds is set as follows in all configurations:

Configuration	Startup Probe	Readiness Probe
startup-probe.yaml	`15`	N/A
readiness-probe.yaml	N/A	`15`
both-probes.yaml	`15`	`30`

You can also cause a validation failure by executing echo-reinstall.sh echo/validation-failure/cpu-config.yaml. This will yield the cpu post-startup requests (...) is greater than startup value (...) status message.

Causing an echo-server Container Restart

Execute echo-cause-container-restart.sh to cause the echo-service container to restart. Note: CrashLoopBackoff may be triggered upon executing this multiple times in succession.

Deleting echo-service

Executing echo-delete.sh deletes the echo-server namespace (including pod).

Cluster/CSA Uninstallation

Executing csa-uninstall.sh uninstalls the CSA kind cluster.

Stuff to Try

First establish a watch on CSA status and enacted container resources and optionally tail CSA logs. You may also want to observe CSA metrics.

Install echo-server with echo/post-startup-resources/startup-probe.yaml and watch as CSA upscales the container for startup, then downscales once the container is started.
Install echo-server with echo/startup-resources/startup-probe.yaml and watch as CSA only downscales once the container is started - note the CSA lastCommanded and lastEnacted status is not populated until downscale.
Repeat 1) and 2) above with a readiness probe only (echo/*/readiness-probe.yaml) and watch as CSA only reacts to the container’s ready status i.e. not started.
Repeat 1) and 2) above with both probes (echo/*/both-probes.yaml) and watch as CSA only reacts to the container’s started status i.e. not ready.
Cause a container restart after post-startup resources are enacted and watch as CSA (re)upscales the container for startup, then downscales once the container is started.
Cause a container restart repeatedly after startup resources are enacted and watch as CSA doesn’t take any action until downscaling after the container is started.
Install echo-server with echo/validation-failure/cpu-config.yaml and observe CSA status when a validation failure occurs.

This site is open source. Improve this page.

container-startup-autoscaler

container-startup-autoscaler 🚀

Navigation

Demo Video

Docker Images

Helm Chart

Motivation

Employ Burstable QoS

Employ Guaranteed QoS (1)

Employ Guaranteed QoS (2)

How it Works

Limitations

Restrictions

Scale Configuration

Labels

Annotations

Probes

Status

Events

Normal Events

Warning Events

Logging

Metrics

Reconciler

Scale

Kube API Retry

Retry

Kube API

Encountering Unknown Resources

CSA Configuration

Controller

Retry

Log

Pod Admission Considerations

Container Scaling Considerations

Best Practices

Tests

Unit

Integration

Running Locally

Cluster/CSA Installation

Tailing CSA Logs

Getting CSA Metrics

Watching CSA Status and Enacted Container Resources

(Re)installing echo-service

Causing an echo-server Container Restart

Deleting echo-service

Cluster/CSA Uninstallation

Stuff to Try