-
Notifications
You must be signed in to change notification settings - Fork 494
Description
The chart currently supports primitive autoscaling for celery workers, using HorizontalPodAutoscalers with memory metrics. But this is very flawed, as there is not necessarily a link between RAM usage, and the number of pending tasks, meaning you could have a situation where your workers don't scale up despite having pending tasks.
We can make a task-aware autoscaler that will scale up the number of celery workers when there are not enough task slots, and scale down when there are too many.
In past, scale down was dangerous to use with airflow workers, as Kubernetes had no way to influence which Pods were removed, meaning Kubernetes often removes a busy worker where there are workers that are doing nothing.
As of Kubernetes 1.22, there is a beta annotation for Pods managed by ReplicaSets called controller.kubernetes.io/pod-deletion-cost, which tells Kubernetes how "expensive" killing a particular Pod is when decreasing the replicas count.
NOTE: Previously we considered using KEDA (#103) to manage autoscaling, but this will not work with
controller.kubernetes.io/pod-deletion-cost, as the HorizontalPodAutoscaler created by KEDA can not patch the required annotations BEFORE scaling down.
Our Celery Worker Autoscaler can perform the following loop:
- Cleanup from any past loops:
- Remove any
controller.kubernetes.io/pod-deletion-costannotations- NOTE: there will only be dangling annotations if Kubernetes did not remove our "chosen" Pods, or if the autoscaler crashed halfway through a loop
- NOTE: we need to attempt to prevent multiple instances of our autoscaler running at a time
- Send each worker Pod that we removed an annotation from an
app.control.add_consumer()command, so it resumes picking up new airflow tasks
- Remove any
- Calculate the ideal number of worker
replicasfor the current task load:- if the
load factorof workers is aboveAforBtime --> increasereplicasto meet the targetload factor - if the
load factorof workers is belowXforYtime --> decreasereplicasto meet the targetload factor- NOTE: the
load factoris the number of available task slots which are consumed - NOTE: we should put some limit on the number of scaling decisions per
Aseconds (to prevent a yo-yo effect), (perhaps have separate limits for down and up to allow faster upscaling) - NOTE: we should have a "scaling algorithm" config, even if we only start with 1
- NOTE: we should have
miniumandmaximumreplicas configs - NOTE: if using CeleryKubernetesExecutor, we must exclude tasks that are in the
AIRFLOW__CELERY_KUBERNETES_EXECUTOR__KUBERNETES_QUEUE
- NOTE: the
- if the
- If
replicasare going to be decreased byN:- Sort the worker pods by their
pod-deletion-costin ascending order- NOTE: the
pod-deletion-costis thenumber of running tasks, weighted by thetotal running timeof each task (so long-running tasks are not needlessly evicted), specifically we want smaller numbers of long-running tasks to be weighted higher than larger numbers of short-running tasks - NOTE: add a DAG/Task label which will prevent any worker running it from being killed (or allow a "weighting" per Task)
- NOTE: the
- Annotate the
Nworker Pods with the lowest cost Pods with thecontroller.kubernetes.io/pod-deletion-costannotation- NOTE: if there are pods in a Pending/Unready state, we can reduce
Nby this number, as Kubernetes will remove these pods first
- NOTE: if there are pods in a Pending/Unready state, we can reduce
- Send each worker Pod that was annotated an
app.control.cancel_consumer(...)command, so does not pick up new airflow tasks after being "marked" for deletion - Patch the
replicasdown byN
- Sort the worker pods by their
Important changes to make this work:
- We will need to use a Deployment for the workers(rather than a StatefulSet), as
controller.kubernetes.io/pod-deletion-costis only for Pods in ReplicaSets - Because
controller.kubernetes.io/pod-deletion-costis alpha in1.21and beta in1.22, for older Kubernetes versions we can let users use the CloneSet from the CNCF project called OpenKruise (instead ofDeployment), as they have back-ported thecontroller.kubernetes.io/pod-deletion-costannotation.