container-startup-autoscaler (CSA) is a Kubernetes controller that modifies the CPU and/or memory resources of containers depending on whether theyβre starting up, according to the startup/post-startup settings you supply. CSA works at the pod level and is agnostic to how the pod is managed; it works with deployments, statefulsets, daemonsets and other workload management APIs.
CSA is implemented using controller-runtime.
CSA is built around Kubeβs In-place Update of Pod Resources
feature, which is currently in alpha state as of Kubernetes 1.29 and therefore requires the InPlacePodVerticalScaling
feature gate to be enabled. Beta/stable targets are indicated here.
The feature implementation (along with the corresponding implementation of CSA) is likely to change until it reaches
stable status. See CHANGELOG.md for details of CSA versions and Kubernetes version compatibility.
β οΈ This controller should currently only be used for preview purposes on local or otherwise non-production Kubernetes clusters.
A local sandbox is provided for previewing CSA - this video shows fundamental CSA operation using the sandbox scripts:
https://github.com/ExpediaGroup/container-startup-autoscaler/assets/76996781/fcea0175-4f09-43d3-9bad-de5aed8806f2
Versioned multi-arch Docker images are available via Docker Hub.
A CSA Helm chart is available - please see its README.md for more information.
The release of Kubernetes 1.27.0
introduced a new, long-awaited alpha feature:
In-place Update of Pod Resources.
This feature allows pod container resources (requests
and limits
) to be updated in-place, without the need to
restart the pod. Prior to this, any changes made to container resources required a pod restart to apply.
A historical concern of running workloads within Kubernetes is how to tune container resources for workloads that have very different resource utilization characteristics during two core phases: startup and post-startup. Given the previous lack of ability to change container resources in-place, there was generally a tradeoff for startup-heavy workloads between obtaining good (and consistent) startup times and overall resource wastage, post-startup:
Set limits
greater than requests
in the hope that resources beyond requests
are actually scavengeable during
startup.
Set limits
the same as requests
, with startup time as the primary factor in determining the value.
Set limits
the same as requests
, with normal workload servicing performance as the primary factor in determining
the value.
The core motivation of CSA is to leverage the new In-place Update of Pod Resources
Kube feature to provide workload
owners with the ability to configure container resources for startup (in a guaranteed fashion) separately from normal
post-startup workload resources. In doing so, the tradeoffs listed above are eliminated and the foundations are laid
for:
CSA is able to target a single non-init/ephemeral container within a pod. Configuration such as the target container name and desired startup/post-startup resource settings are contained within a number of pod annotations.
CSA watches for changes in pods that are marked as eligible for scaling (via a label). Upon processing an eligible podβs changes, CSA examines the current state of the target container and takes one of several actions based on that state:
CSA will react when the target container is initially created (by its pod) and if Kube restarts the target container.
CSA will not perform any scaling action if it doesnβt need to - for example, if the target container repeatedly fails
to start prior to it becoming ready (with Kube reacting with restarts in a CrashLoopBackOff
manner), CSA will only
apply startup resources once.
CSA generates metrics and pod Kube events, along with a detailed status thatβs included within an annotation of the scaled pod.
The following limitations are currently in place:
requests
== limits
) to match the guaranteed
nature of startup resources - Kube API currently rejects any change in resource QoS. This should be addressed as the
In-place Update of Pod Resources
feature matures.requests
== limits
) to match the guaranteed nature of startup
resources per above.The following restrictions are currently in place and enforced where applicable:
requests
must be lower than startup resources.requests
for both CPU and memory.NotRequired
resize policy for both CPU and memory.The following labels must be present in the pod that includes your target container:
Name | Value | Description |
---|---|---|
csa.expediagroup.com/enabled |
"true" |
Indicates a container in the pod is eligible for scaling - must be "true" . |
The following annotations must be present in the pod that includes your target container:
Name | Example Value | Description |
---|---|---|
csa.expediagroup.com/target-container-name |
"mycontainer" |
The name of the container to target. |
csa.expediagroup.com/cpu-startup |
"500m" * |
Startup CPU (applied to both requests and limits ). |
csa.expediagroup.com/cpu-post-startup-requests |
"250m" * |
Post-startup CPU requests . |
csa.expediagroup.com/cpu-post-startup-limits |
"250m" * |
Post-startup CPU limits . |
csa.expediagroup.com/memory-startup |
"500M" * |
Startup memory (applied to both requests and limits ). |
csa.expediagroup.com/memory-post-startup-requests |
"250M" * |
Post-startup memory requests . |
csa.expediagroup.com/memory-post-startup-limits |
"250M" * |
Post-startup memory limits . |
* Any CPU/memory form listed here can be used.
CSA needs to know when the target container is starting up and therefore requires you to specify an appropriately configured startup or readiness probe (or both).
If the target container specifies a startup probe, CSA always uses Kubeβs started
signal of the containerβs status to
determine whether the container is started. Otherwise, if only a readiness probe is specified, CSA primarily uses the
ready
signal of the containerβs status to determine whether the container is started.
Itβs preferable to have a startup probe defined since this unambiguously indicates whether a container is started whereas only a readiness probe may indicate other conditions that will cause unnecessary scaling (e.g. the readiness probe transiently failing post-startup).
Kubeβs container status started
and ready
signal behavior is as follows:
When only a startup probe is present:
started
is false
when the container is (re)started and true
when the startup probe succeeds.ready
is false
when the container is (re)started and true
when started
is true
.When only a readiness probe is present:
started
is false
when the container is (re)started and true
when the container is running and has passed the
postStart
lifecycle hook.ready
is false
when container is (re)started and true
when the readiness probe succeeds.When both startup and readiness probes are present:
started
is false
when container is (re)started and true
when the startup probe succeeds.ready
is false
when container is (re)started and true
when the readiness probe succeeds.CSA reports its status in JSON via the csa.expediagroup.com/status
annotation. You can retrieve and format the status
using kubectl
and jq
as follows:
kubectl get pod <name> -n <namespace> -o=jsonpath='{.items[0].metadata.annotations.csa\.expediagroup\.com\/status}' | jq
Example output:
{
"status": "Post-startup resources enacted",
"states": {
"startupProbe": "true",
"readinessProbe": "true",
"container": "running",
"started": "true",
"ready": "false",
"resources": "poststartup",
"allocatedResources": "containerrequestsmatch",
"statusResources": "containerresourcesmatch"
},
"scale": {
"lastCommanded": "2023-09-14T08:18:44.174+0000",
"lastEnacted": "2023-09-14T08:18:45.382+0000",
"lastFailed": ""
},
"lastUpdated": "2023-09-14T08:18:45+0000"
}
Explanation of status items:
Item | Sub Item | Description |
---|---|---|
status |
- | Human-readable status. Any validation errors are indicated here. |
states |
- | The states of the target container. |
states |
startupProbe |
Whether a startup probe exists. |
states |
readinessProbe |
Whether a readiness probe exists. |
states |
container |
The container status e.g. waiting , running . |
states |
started |
Whether the container is signalled as started by Kube. |
states |
ready |
Whether the container is signalled as ready by Kube. |
states |
resources |
The type of resources (startup/post-startup) that are currently applied (but not necessarily enacted). |
states |
allocatedResources |
How the reported container allocated resources relate to container requests. |
states |
statusResources |
How the reported currently allocated resources relate to container resources. |
scale |
- | Information around scaling activity. |
scale |
lastCommanded |
The last time a scale was commanded (UTC). |
scale |
lastEnacted |
The last time a scale was enacted (UTC; empty if failed). |
scale |
lastFailed |
The last time a scale failed (UTC; empty if enacted). |
lastUpdated |
- | The last time this status was updated. |
The following Kube events for the pod that houses the target container are generated:
| Trigger | Reason |
|βββββββββββββ|ββββ|
| Startup resources are commanded. | Scaling
|
| Startup resources are enacted. | Scaling
|
| Post-startup resources are commanded. | Scaling
|
| Post-startup resources are enacted. | Scaling
|
| Trigger | Reason |
|βββββββββββββββββ|βββββ|
| Validation failure. | Validation
|
| Failed to scale commanded startup resources. | Scaling
|
| Failed to scale commanded post-startup resources. | Scaling
|
CSA uses the logr API with zerologr to log
JSON-based error
-, info
-, debug
- and trace
-level messages.
When configuring verbosity, info
-level messages have a verbosity (v
) of 0,
debug
-level messages have a v
of 1, and debug
-level messages have a v
of 2 - this is mapped via zerologr.
Regardless of configured logging verbosity, error
-level messages are always emitted.
Example info
-level log:
{
"level": "info",
"controller": "container-startup-autoscaler",
"namespace": "echoserver",
"name": "echoserver-5f65d8f65d-mvqt8",
"reconcileID": "6157dd49-7aa9-4cac-bbaf-a739fa48cc61",
"targetname": "echoserver",
"targetstates": {
"startupProbe": "true",
"readinessProbe": "true",
"container": "running",
"started": "true",
"ready": "false",
"resources": "poststartup",
"allocatedResources": "containerrequestsmatch",
"statusResources": "containerresourcesmatch"
},
"caller": "container-startup-autoscaler/internal/pod/targetcontaineraction.go:472",
"time": 1694681974425,
"message": "post-startup resources enacted"
}
Each message includes a number of keys that originate from controller-runtime and zerologr. CSA-added values include:
targetname
: the name of the container to target.targetstates
: the states of the target container, per status.Regardless of configured logging verbosity, error
-level messages are always displayed and additionally include a
stack trace key (stacktrace
), if available.
Additional CSA-specific metrics are registered to the Prometheus registry exposed by controller-runtime and exposed
on port 8080 and path /metrics
e.g. http://localhost:8080/metrics
. CSA metrics are not pre-initialized with 0
values.
Prefixed with csa_reconciler_
:
Metric Name | Type | Labels | Description |
---|---|---|---|
skipped_only_status_change |
Counter | controller |
Number of reconciles that were skipped because only the scaler controller status changed. |
existing_in_progress |
Counter | controller |
Number of attempted reconciles where one was already in progress for the same namespace/name (results in a requeue). |
failure_unable_to_get_pod |
Counter | controller |
Number of reconciles where there was a failure to get the pod (results in a requeue). |
failure_pod_doesnt_exist |
Counter | controller |
Number of reconciles where the pod was found not to exist (results in failure). |
failure_validation |
Counter | controller |
Number of reconciles where there was a failure to validate (results in failure). |
failure_states_determination |
Counter | controller |
Number of reconciles where there was a failure to determine states (results in failure). |
failure_states_action |
Counter | controller |
Number of reconciles where there was a failure to action the determined states (results in failure). |
Labels:
controller
: the CSA controller name.Prefixed with csa_scale_
:
Metric Name | Type | Labels | Description |
---|---|---|---|
failure |
Counter | controller , direction , reason |
Number of scale failures. |
commanded_unknown_resources |
Counter | controller |
Number of scales commanded upon encountering unknown resources (see here). |
duration_seconds |
Histogram | controller , direction , outcome |
Scale duration (from commanded to enacted). |
Labels:
controller
: the CSA controller name.direction
: the direction of the scale - up
/down
.reason
: the reason why the scale failed.outcome
: the outcome of the scale - success
/failure
.Prefixed with csa_retrykubeapi_
:
Metric Name | Type | Labels | Description |
---|---|---|---|
retry |
Counter | controller , reason |
Number of Kube API retries. |
Labels:
controller
: the CSA controller name.reason
: the Kube API response that caused a retry to occur.See below for more information on retries.
Unless Kube API reports that a pod is not found upon trying to retrieve it, all Kube API interactions are subject to retry according to CSA retry configuration.
CSA handles situations where Kube API reports a conflict upon a pod update. In this case, CSA retrieves the latest version of the pod and reapplies the update, before trying again (subject to retry configuration).
By default, CSA will yield an error if it encounters resources applied to a target container that it doesnβt recognize
i.e. resources other than those specified within the pod startup or post-startup resource annotations. This may
occur if resources are updated by an actor other than CSA. To allow corrective scaling upon encountering such a
condition, set the --scale-when-unknown-resources
configuration flag to true
.
When enabled and upon encountering such conditions, CSA will:
(unknown resources applied)
commanded_unknown_resources
metric.down
within the failure
and duration_seconds
(as
applicable) metrics.up
within the failure
and duration_seconds
(as
applicable) metrics.CSA uses the Cobra CLI library and exposes a number of optional configuration flags. All configuration flags are always logged upon CSA start.
| Flag | Type | Default Value | Description |
|βββββββββββββ-|βββ|βββββ|βββββββββββββββββββββββββββββββββββββ|
| --kubeconfig
| String | - | Absolute path to the cluster kubeconfig file (uses in-cluster configuration if not supplied). |
| --leader-election-enabled
| Boolean | true
| Whether to enable leader election. |
| --leader-election-resource-namespace
| String | - | The namespace to create resources in if leader election is enabled (uses current namespace if not supplied). |
| --cache-sync-period-mins
| Integer | 60
| How frequently the informer should re-sync. |
| --graceful-shutdown-timeout-secs
| Integer | 10
| How long to allow busy workers to complete upon shutdown. |
| --requeue-duration-secs
| Integer | 3
| How long to wait before requeuing a reconcile. |
| --max-concurrent-reconciles
| Integer | 10
| The maximum number of concurrent reconciles. |
| --scale-when-unknown-resources
| Boolean | false
| Whether to scale when unknown resources are encountered. |
| Flag | Type | Default Value | Description |
|ββββββββββββ|βββ|βββββ|βββββββββββββββββββββ-|
| --standard-retry-attempts
| Integer | 3
| The maximum number of attempts for a standard retry. |
| --standard-retry-delay-secs
| Integer | 1
| The number of seconds to wait between standard retry attempts. |
| Flag | Type | Default Value | Description |
|βββββββ|βββ|βββββ|ββββββββββββββββββββββββββββ|
| --log-v
| Integer | 0
| Log verbosity level (0: info, 1: debug, 2: trace) - 2 used if invalid. |
| --log-add-caller
| Boolean | false
| Whether to include the caller within logging output. |
Upon pod cluster admission, CSA will attempt to upscale the target container to its startup configuration. Upscaling success depends on node loading conditions - itβs therefore possible that the scale is delayed or fails altogether, particularly if a cluster consolidation mechanism is employed.
In order to mitigate the effects of initial startup upscaling, itβs recommended to admit pods with the target container startup configuration already applied - CSA will not need to initially upscale in this case. Once startup has completed, the subsequent downscale to apply post-startup resources is significantly less likely to fail since itβs not subject to node loading conditions. In addition, any failure mode results in overall resource over-provisioning rather than startup under-provisioning.
Itβs important to note that in either case, CSA will need to upscale if Kube restarts the target container.
apiVersion: apps/v1
kind: Deployment
spec:
template:
metadata:
labels:
csa.expediagroup.com/enabled: "true"
annotations:
csa.expediagroup.com/target-container-name: target-container
csa.expediagroup.com/cpu-startup: 500m
csa.expediagroup.com/cpu-post-startup-requests: 100m
csa.expediagroup.com/cpu-post-startup-limits: 100m
csa.expediagroup.com/memory-startup: 500M
csa.expediagroup.com/memory-post-startup-requests: 100M
csa.expediagroup.com/memory-post-startup-limits: 100M
spec:
containers:
- name: target-container
resources:
limits:
cpu: 500m # Admitted with csa.expediagroup.com/cpu-startup value
memory: 500M # Admitted with csa.expediagroup.com/memory-startup value
requests:
cpu: 500m # Admitted with csa.expediagroup.com/cpu-startup value
memory: 500M # Admitted with csa.expediagroup.com/memory-startup value
Please consider carefully whether itβs appropriate to scale memory during execution of your container. Memory management differs between runtimes, and itβs not necessarily possible to change any runtime configuration (e.g. limits) set at the point of admission without restarting the container. Some runtimes may also default memory management settings based on available resources, which may no longer be optimal when memory is scaled.
In addition, some languages/frameworks may default configuration of concurrency mechanisms (e.g. thread pools) based on available CPU resources - this should be taken into consideration if applicable.
Unit tests can be run by executing make test-run-unit
from the root directory.
Integration tests can be run by executing make test-run-int
or make test-run-int-verbose
(verbose logging) from the
root directory. Please ensure youβre using a version of Go thatβs at least that of the version thatβs indicated at the
top of go.mod.
Integration tests are implemented as Go tests and located in test/integration
. During initialization of the tests, a
kind cluster is created (with a specific name); CSA is built via Docker and run via
Helm. Tools are not bundled with the tests, so you must have the following installed locally (test
development versions indicated):
The integration tests use echo-server for containers. Note: the very first execution might take some time to complete.
A number of environment variable-based configuration options are available:
Name | Default | Description |
---|---|---|
MAX_PARALLELISM |
4 |
The maximum number of tests that can run in parallel. |
REUSE_CLUSTER |
false |
Whether to reuse an existing CSA kind cluster (if it already exists). |
INSTALL_METRICS_SERVER |
false |
Whether to install metrics-server. |
KEEP_CSA |
false |
Whether to keep the CSA installation after tests finish. |
KEEP_CLUSTER |
false |
Whether to keep the CSA kind cluster after tests finish. |
DELETE_NS_AFTER_TEST |
true |
Whether to delete namespaces created by tests after they conclude. |
Integration tests are executed in parallel due to their long-running nature. Each test operates within a separate Kube
namespace (but using the same single CSA installation). If local resources are limited, reduce MAX_PARALLELISM
accordingly and ensure DELETE_NS_AFTER_TEST
is true
. Each test typically spins up 2 pods, each with 2 containers;
see source for resource allocations.
A number of Bash scripts are supplied in the scripts/sandbox
directory that allow you to try out CSA using
echo-server. The scripts are similar in nature to the setup/teardown work
performed in the integration tests and have the same local tool requirements. Please ensure youβre
using a version of Go thatβs at least that of the version thatβs indicated at the top of go.mod. Note: the
kind cluster created by the scripts is named differently to the integration tests such that both can
exist in parallel, if desired.
Executing csa-install.sh
:
$HOME/.kube/
.2
(trace).Note: the very first execution might take some time to complete.
Executing csa-tail-logs.sh
tails logs from the current CSA leader pod.
Executing csa-get-metrics.sh
gets metrics from the current CSA leader pod.
Executing echo-watch.sh
watch
es the CSA status for the pod created below along with the target
containerβs enacted resources.
Execute echo-reinstall.sh
to (re)install echo-service with a specific probe configuration contained within
the echo
directory structure:
Admit with post-startup resources (initial upscale required):
echo-reinstall.sh echo/post-startup-resources/startup-probe.yaml
: single replica/container deployment with startup
probe only.echo-reinstall.sh echo/post-startup-resources/readiness-probe.yaml
: single replica/container deployment with readiness
probe only.echo-reinstall.sh echo/post-startup-resources/both-probes.yaml
: single replica/container deployment with both startup
and readiness probes.Admit with startup resources (initial upscale not required):
echo-reinstall.sh echo/startup-resources/startup-probe.yaml
: single replica/container deployment with startup probe only.echo-reinstall.sh echo/startup-resources/readiness-probe.yaml
: single replica/container deployment with readiness probe
only.echo-reinstall.sh echo/startup-resources/both-probes.yaml
: single replica/container deployment with both startup and
readiness probes.To simulate workload startup/readiness, initialDelaySeconds
is set as follows in all configurations:
Configuration | Startup Probe | Readiness Probe |
---|---|---|
startup-probe.yaml | 15 |
N/A |
readiness-probe.yaml | N/A | 15 |
both-probes.yaml | 15 |
30 |
You can also cause a validation failure by executing echo-reinstall.sh echo/validation-failure/cpu-config.yaml
. This
will yield the cpu post-startup requests (...) is greater than startup value (...)
status message.
Execute echo-cause-container-restart.sh
to cause the echo-service container to restart. Note: CrashLoopBackoff
may be triggered upon executing this multiple times in succession.
Executing echo-delete.sh
deletes the echo-server namespace (including pod).
Executing csa-uninstall.sh
uninstalls the CSA kind cluster.
First establish a watch on CSA status and enacted container resources and optionally tail CSA logs. You may also want to observe CSA metrics.
echo/post-startup-resources/startup-probe.yaml
and watch as
CSA upscales the container for startup, then downscales once the container is started.echo/startup-resources/startup-probe.yaml
and watch as CSA only downscales once the
container is started - note the CSA lastCommanded
and lastEnacted
status is not populated until downscale.echo/*/readiness-probe.yaml
) and watch as CSA only reacts to
the containerβs ready
status i.e. not started
.echo/*/both-probes.yaml
) and watch as CSA only reacts to the containerβs
started
status i.e. not ready
.echo/validation-failure/cpu-config.yaml
and observe CSA status when a validation failure
occurs.