container-startup-autoscaler

container-startup-autoscaler πŸš€

container-startup-autoscaler (CSA) is a Kubernetes controller that modifies the CPU and/or memory resources of containers depending on whether they’re starting up, according to the startup/post-startup settings you supply. CSA works at the pod level and is agnostic to how the pod is managed; it works with deployments, statefulsets, daemonsets and other workload management APIs.

An overview diagram of CSA showing when target containers are scaled

CSA is implemented using controller-runtime.

CSA is built around Kube’s In-place Update of Pod Resources feature, which is currently in alpha state as of Kubernetes 1.29 and therefore requires the InPlacePodVerticalScaling feature gate to be enabled. Beta/stable targets are indicated here. The feature implementation (along with the corresponding implementation of CSA) is likely to change until it reaches stable status. See CHANGELOG.md for details of CSA versions and Kubernetes version compatibility.

⚠️ This controller should currently only be used for preview purposes on local or otherwise non-production Kubernetes clusters.

Demo Video

A local sandbox is provided for previewing CSA - this video shows fundamental CSA operation using the sandbox scripts:

https://github.com/ExpediaGroup/container-startup-autoscaler/assets/76996781/fcea0175-4f09-43d3-9bad-de5aed8806f2

Docker Images

Versioned multi-arch Docker images are available via Docker Hub.

Helm Chart

A CSA Helm chart is available - please see its README.md for more information.

Motivation

The release of Kubernetes 1.27.0 introduced a new, long-awaited alpha feature: In-place Update of Pod Resources. This feature allows pod container resources (requests and limits) to be updated in-place, without the need to restart the pod. Prior to this, any changes made to container resources required a pod restart to apply.

A historical concern of running workloads within Kubernetes is how to tune container resources for workloads that have very different resource utilization characteristics during two core phases: startup and post-startup. Given the previous lack of ability to change container resources in-place, there was generally a tradeoff for startup-heavy workloads between obtaining good (and consistent) startup times and overall resource wastage, post-startup:

Employ Burstable QoS

Set limits greater than requests in the hope that resources beyond requests are actually scavengeable during startup.

Employ Guaranteed QoS (1)

Set limits the same as requests, with startup time as the primary factor in determining the value.

Employ Guaranteed QoS (2)

Set limits the same as requests, with normal workload servicing performance as the primary factor in determining the value.


The core motivation of CSA is to leverage the new In-place Update of Pod Resources Kube feature to provide workload owners with the ability to configure container resources for startup (in a guaranteed fashion) separately from normal post-startup workload resources. In doing so, the tradeoffs listed above are eliminated and the foundations are laid for:

How it Works

CSA is able to target a single non-init/ephemeral container within a pod. Configuration such as the target container name and desired startup/post-startup resource settings are contained within a number of pod annotations.

CSA watches for changes in pods that are marked as eligible for scaling (via a label). Upon processing an eligible pod’s changes, CSA examines the current state of the target container and takes one of several actions based on that state:

CSA will react when the target container is initially created (by its pod) and if Kube restarts the target container.

CSA will not perform any scaling action if it doesn’t need to - for example, if the target container repeatedly fails to start prior to it becoming ready (with Kube reacting with restarts in a CrashLoopBackOff manner), CSA will only apply startup resources once.

CSA generates metrics and pod Kube events, along with a detailed status that’s included within an annotation of the scaled pod.

Limitations

The following limitations are currently in place:

Restrictions

The following restrictions are currently in place and enforced where applicable:

Scale Configuration

Labels

The following labels must be present in the pod that includes your target container:

Name Value Description
csa.expediagroup.com/enabled "true" Indicates a container in the pod is eligible for scaling - must be "true".

Annotations

The following annotations must be present in the pod that includes your target container:

Name Example Value Description
csa.expediagroup.com/target-container-name "mycontainer" The name of the container to target.
csa.expediagroup.com/cpu-startup "500m"* Startup CPU (applied to both requests and limits).
csa.expediagroup.com/cpu-post-startup-requests "250m"* Post-startup CPU requests.
csa.expediagroup.com/cpu-post-startup-limits "250m"* Post-startup CPU limits.
csa.expediagroup.com/memory-startup "500M"* Startup memory (applied to both requests and limits).
csa.expediagroup.com/memory-post-startup-requests "250M"* Post-startup memory requests.
csa.expediagroup.com/memory-post-startup-limits "250M"* Post-startup memory limits.

* Any CPU/memory form listed here can be used.

Probes

CSA needs to know when the target container is starting up and therefore requires you to specify an appropriately configured startup or readiness probe (or both).

If the target container specifies a startup probe, CSA always uses Kube’s started signal of the container’s status to determine whether the container is started. Otherwise, if only a readiness probe is specified, CSA primarily uses the ready signal of the container’s status to determine whether the container is started.

It’s preferable to have a startup probe defined since this unambiguously indicates whether a container is started whereas only a readiness probe may indicate other conditions that will cause unnecessary scaling (e.g. the readiness probe transiently failing post-startup).


Kube’s container status started and ready signal behavior is as follows:

When only a startup probe is present:

When only a readiness probe is present:

When both startup and readiness probes are present:

Status

CSA reports its status in JSON via the csa.expediagroup.com/status annotation. You can retrieve and format the status using kubectl and jq as follows:

kubectl get pod <name> -n <namespace> -o=jsonpath='{.items[0].metadata.annotations.csa\.expediagroup\.com\/status}' | jq

Example output:

{
  "status": "Post-startup resources enacted",
  "states": {
    "startupProbe": "true",
    "readinessProbe": "true",
    "container": "running",
    "started": "true",
    "ready": "false",
    "resources": "poststartup",
    "allocatedResources": "containerrequestsmatch",
    "statusResources": "containerresourcesmatch"
  },
  "scale": {
    "lastCommanded": "2023-09-14T08:18:44.174+0000",
    "lastEnacted": "2023-09-14T08:18:45.382+0000",
    "lastFailed": ""
  },
  "lastUpdated": "2023-09-14T08:18:45+0000"
}

Explanation of status items:

Item Sub Item Description
status - Human-readable status. Any validation errors are indicated here.
states - The states of the target container.
states startupProbe Whether a startup probe exists.
states readinessProbe Whether a readiness probe exists.
states container The container status e.g. waiting, running.
states started Whether the container is signalled as started by Kube.
states ready Whether the container is signalled as ready by Kube.
states resources The type of resources (startup/post-startup) that are currently applied (but not necessarily enacted).
states allocatedResources How the reported container allocated resources relate to container requests.
states statusResources How the reported currently allocated resources relate to container resources.
scale - Information around scaling activity.
scale lastCommanded The last time a scale was commanded (UTC).
scale lastEnacted The last time a scale was enacted (UTC; empty if failed).
scale lastFailed The last time a scale failed (UTC; empty if enacted).
lastUpdated - The last time this status was updated.

Events

The following Kube events for the pod that houses the target container are generated:

Normal Events

| Trigger | Reason | |β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”|———–| | Startup resources are commanded. | Scaling | | Startup resources are enacted. | Scaling | | Post-startup resources are commanded. | Scaling | | Post-startup resources are enacted. | Scaling |

Warning Events

| Trigger | Reason | |β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”|————–| | Validation failure. | Validation | | Failed to scale commanded startup resources. | Scaling | | Failed to scale commanded post-startup resources. | Scaling |

Logging

CSA uses the logr API with zerologr to log JSON-based error-, info-, debug- and trace-level messages.

When configuring verbosity, info-level messages have a verbosity (v) of 0, debug-level messages have a v of 1, and debug-level messages have a v of 2 - this is mapped via zerologr. Regardless of configured logging verbosity, error-level messages are always emitted.

Example info-level log:

{
	"level": "info",
	"controller": "container-startup-autoscaler",
	"namespace": "echoserver",
	"name": "echoserver-5f65d8f65d-mvqt8",
	"reconcileID": "6157dd49-7aa9-4cac-bbaf-a739fa48cc61",
	"targetname": "echoserver",
	"targetstates": {
		"startupProbe": "true",
		"readinessProbe": "true",
		"container": "running",
		"started": "true",
		"ready": "false",
		"resources": "poststartup",
		"allocatedResources": "containerrequestsmatch",
		"statusResources": "containerresourcesmatch"
	},
	"caller": "container-startup-autoscaler/internal/pod/targetcontaineraction.go:472",
	"time": 1694681974425,
	"message": "post-startup resources enacted"
}

Each message includes a number of keys that originate from controller-runtime and zerologr. CSA-added values include:

Regardless of configured logging verbosity, error-level messages are always displayed and additionally include a stack trace key (stacktrace), if available.

Metrics

Additional CSA-specific metrics are registered to the Prometheus registry exposed by controller-runtime and exposed on port 8080 and path /metrics e.g. http://localhost:8080/metrics. CSA metrics are not pre-initialized with 0 values.

Reconciler

Prefixed with csa_reconciler_:

Metric Name Type Labels Description
skipped_only_status_change Counter controller Number of reconciles that were skipped because only the scaler controller status changed.
existing_in_progress Counter controller Number of attempted reconciles where one was already in progress for the same namespace/name (results in a requeue).
failure_unable_to_get_pod Counter controller Number of reconciles where there was a failure to get the pod (results in a requeue).
failure_pod_doesnt_exist Counter controller Number of reconciles where the pod was found not to exist (results in failure).
failure_validation Counter controller Number of reconciles where there was a failure to validate (results in failure).
failure_states_determination Counter controller Number of reconciles where there was a failure to determine states (results in failure).
failure_states_action Counter controller Number of reconciles where there was a failure to action the determined states (results in failure).

Labels:

Scale

Prefixed with csa_scale_:

Metric Name Type Labels Description
failure Counter controller, direction, reason Number of scale failures.
commanded_unknown_resources Counter controller Number of scales commanded upon encountering unknown resources (see here).
duration_seconds Histogram controller, direction, outcome Scale duration (from commanded to enacted).

Labels:

Kube API Retry

Prefixed with csa_retrykubeapi_:

Metric Name Type Labels Description
retry Counter controller, reason Number of Kube API retries.

Labels:

See below for more information on retries.

Retry

Kube API

Unless Kube API reports that a pod is not found upon trying to retrieve it, all Kube API interactions are subject to retry according to CSA retry configuration.

CSA handles situations where Kube API reports a conflict upon a pod update. In this case, CSA retrieves the latest version of the pod and reapplies the update, before trying again (subject to retry configuration).

Encountering Unknown Resources

By default, CSA will yield an error if it encounters resources applied to a target container that it doesn’t recognize i.e. resources other than those specified within the pod startup or post-startup resource annotations. This may occur if resources are updated by an actor other than CSA. To allow corrective scaling upon encountering such a condition, set the --scale-when-unknown-resources configuration flag to true.

When enabled and upon encountering such conditions, CSA will:

CSA Configuration

CSA uses the Cobra CLI library and exposes a number of optional configuration flags. All configuration flags are always logged upon CSA start.

Controller

| Flag | Type | Default Value | Description | |β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”-|β€”β€”β€”|β€”β€”β€”β€”β€”|————————————————————————————————————–| | --kubeconfig | String | - | Absolute path to the cluster kubeconfig file (uses in-cluster configuration if not supplied). | | --leader-election-enabled | Boolean | true | Whether to enable leader election. | | --leader-election-resource-namespace | String | - | The namespace to create resources in if leader election is enabled (uses current namespace if not supplied). | | --cache-sync-period-mins | Integer | 60 | How frequently the informer should re-sync. | | --graceful-shutdown-timeout-secs | Integer | 10 | How long to allow busy workers to complete upon shutdown. | | --requeue-duration-secs | Integer | 3 | How long to wait before requeuing a reconcile. | | --max-concurrent-reconciles | Integer | 10 | The maximum number of concurrent reconciles. | | --scale-when-unknown-resources | Boolean | false | Whether to scale when unknown resources are encountered. |

Retry

| Flag | Type | Default Value | Description | |β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”|β€”β€”β€”|β€”β€”β€”β€”β€”|β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”-| | --standard-retry-attempts | Integer | 3 | The maximum number of attempts for a standard retry. | | --standard-retry-delay-secs | Integer | 1 | The number of seconds to wait between standard retry attempts. |

Log

| Flag | Type | Default Value | Description | |——————–|β€”β€”β€”|β€”β€”β€”β€”β€”|β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”| | --log-v | Integer | 0 | Log verbosity level (0: info, 1: debug, 2: trace) - 2 used if invalid. | | --log-add-caller | Boolean | false | Whether to include the caller within logging output. |

Pod Admission Considerations

Upon pod cluster admission, CSA will attempt to upscale the target container to its startup configuration. Upscaling success depends on node loading conditions - it’s therefore possible that the scale is delayed or fails altogether, particularly if a cluster consolidation mechanism is employed.

In order to mitigate the effects of initial startup upscaling, it’s recommended to admit pods with the target container startup configuration already applied - CSA will not need to initially upscale in this case. Once startup has completed, the subsequent downscale to apply post-startup resources is significantly less likely to fail since it’s not subject to node loading conditions. In addition, any failure mode results in overall resource over-provisioning rather than startup under-provisioning.

It’s important to note that in either case, CSA will need to upscale if Kube restarts the target container.

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    metadata:
      labels:
        csa.expediagroup.com/enabled: "true"
      annotations:
        csa.expediagroup.com/target-container-name: target-container
        csa.expediagroup.com/cpu-startup: 500m
        csa.expediagroup.com/cpu-post-startup-requests: 100m
        csa.expediagroup.com/cpu-post-startup-limits: 100m
        csa.expediagroup.com/memory-startup: 500M
        csa.expediagroup.com/memory-post-startup-requests: 100M
        csa.expediagroup.com/memory-post-startup-limits: 100M
    spec:
      containers:
      - name: target-container
        resources:
          limits:
            cpu: 500m    # Admitted with csa.expediagroup.com/cpu-startup value 
            memory: 500M # Admitted with csa.expediagroup.com/memory-startup value
          requests:
            cpu: 500m    # Admitted with csa.expediagroup.com/cpu-startup value
            memory: 500M # Admitted with csa.expediagroup.com/memory-startup value

Container Scaling Considerations

Please consider carefully whether it’s appropriate to scale memory during execution of your container. Memory management differs between runtimes, and it’s not necessarily possible to change any runtime configuration (e.g. limits) set at the point of admission without restarting the container. Some runtimes may also default memory management settings based on available resources, which may no longer be optimal when memory is scaled.

In addition, some languages/frameworks may default configuration of concurrency mechanisms (e.g. thread pools) based on available CPU resources - this should be taken into consideration if applicable.

Best Practices

Tests

Unit

Unit tests can be run by executing make test-run-unit from the root directory.

Integration

Integration tests can be run by executing make test-run-int or make test-run-int-verbose (verbose logging) from the root directory. Please ensure you’re using a version of Go that’s at least that of the version that’s indicated at the top of go.mod.

Integration tests are implemented as Go tests and located in test/integration. During initialization of the tests, a kind cluster is created (with a specific name); CSA is built via Docker and run via Helm. Tools are not bundled with the tests, so you must have the following installed locally (test development versions indicated):

The integration tests use echo-server for containers. Note: the very first execution might take some time to complete.

A number of environment variable-based configuration options are available:

Name Default Description
MAX_PARALLELISM 4 The maximum number of tests that can run in parallel.
REUSE_CLUSTER false Whether to reuse an existing CSA kind cluster (if it already exists).
INSTALL_METRICS_SERVER false Whether to install metrics-server.
KEEP_CSA false Whether to keep the CSA installation after tests finish.
KEEP_CLUSTER false Whether to keep the CSA kind cluster after tests finish.
DELETE_NS_AFTER_TEST true Whether to delete namespaces created by tests after they conclude.

Integration tests are executed in parallel due to their long-running nature. Each test operates within a separate Kube namespace (but using the same single CSA installation). If local resources are limited, reduce MAX_PARALLELISM accordingly and ensure DELETE_NS_AFTER_TEST is true. Each test typically spins up 2 pods, each with 2 containers; see source for resource allocations.

Running Locally

A number of Bash scripts are supplied in the scripts/sandbox directory that allow you to try out CSA using echo-server. The scripts are similar in nature to the setup/teardown work performed in the integration tests and have the same local tool requirements. Please ensure you’re using a version of Go that’s at least that of the version that’s indicated at the top of go.mod. Note: the kind cluster created by the scripts is named differently to the integration tests such that both can exist in parallel, if desired.

Cluster/CSA Installation

Executing csa-install.sh:

Note: the very first execution might take some time to complete.

Tailing CSA Logs

Executing csa-tail-logs.sh tails logs from the current CSA leader pod.

Getting CSA Metrics

Executing csa-get-metrics.sh gets metrics from the current CSA leader pod.

Watching CSA Status and Enacted Container Resources

Executing echo-watch.sh watches the CSA status for the pod created below along with the target container’s enacted resources.

(Re)installing echo-service

Execute echo-reinstall.sh to (re)install echo-service with a specific probe configuration contained within the echo directory structure:

Admit with post-startup resources (initial upscale required):

Admit with startup resources (initial upscale not required):

To simulate workload startup/readiness, initialDelaySeconds is set as follows in all configurations:

Configuration Startup Probe Readiness Probe
startup-probe.yaml 15 N/A
readiness-probe.yaml N/A 15
both-probes.yaml 15 30

You can also cause a validation failure by executing echo-reinstall.sh echo/validation-failure/cpu-config.yaml. This will yield the cpu post-startup requests (...) is greater than startup value (...) status message.

Causing an echo-server Container Restart

Execute echo-cause-container-restart.sh to cause the echo-service container to restart. Note: CrashLoopBackoff may be triggered upon executing this multiple times in succession.

Deleting echo-service

Executing echo-delete.sh deletes the echo-server namespace (including pod).

Cluster/CSA Uninstallation

Executing csa-uninstall.sh uninstalls the CSA kind cluster.

Stuff to Try

First establish a watch on CSA status and enacted container resources and optionally tail CSA logs. You may also want to observe CSA metrics.

  1. Install echo-server with echo/post-startup-resources/startup-probe.yaml and watch as CSA upscales the container for startup, then downscales once the container is started.
  2. Install echo-server with echo/startup-resources/startup-probe.yaml and watch as CSA only downscales once the container is started - note the CSA lastCommanded and lastEnacted status is not populated until downscale.
  3. Repeat 1) and 2) above with a readiness probe only (echo/*/readiness-probe.yaml) and watch as CSA only reacts to the container’s ready status i.e. not started.
  4. Repeat 1) and 2) above with both probes (echo/*/both-probes.yaml) and watch as CSA only reacts to the container’s started status i.e. not ready.
  5. Cause a container restart after post-startup resources are enacted and watch as CSA (re)upscales the container for startup, then downscales once the container is started.
  6. Cause a container restart repeatedly after startup resources are enacted and watch as CSA doesn’t take any action until downscaling after the container is started.
  7. Install echo-server with echo/validation-failure/cpu-config.yaml and observe CSA status when a validation failure occurs.