task-aware celery worker autoscaling (+ `pod-deletion-cost`)

The chart currently supports primitive [autoscaling for celery workers](https://github.com/airflow-helm/charts/blob/main/charts/airflow/docs/faq/configuration/autoscaling-celery-workers.md), using HorizontalPodAutoscalers with memory metrics. But this is very flawed, as there is not necessarily a link between RAM usage, and the number of pending tasks, meaning you could have a situation where your workers don't scale up despite having pending tasks.

We can make a task-aware autoscaler that will __scale up__ the number of celery workers when there are not enough task slots, and __scale down__ when there are too many.

In past, __scale down__ was dangerous to use with airflow workers, as Kubernetes had no way to influence which Pods were removed, meaning Kubernetes often removes a busy worker where there are workers that are doing nothing.

As of Kubernetes 1.22, there is a __beta__ annotation for `Pods` managed by `ReplicaSets` called [`controller.kubernetes.io/pod-deletion-cost`](https://kubernetes.io/docs/reference/labels-annotations-taints/#pod-deletion-cost), which tells Kubernetes how "expensive" killing a particular `Pod` is when decreasing the `replicas` count.

> __NOTE:__ Previously we considered using KEDA (https://github.com/airflow-helm/charts/issues/103) to manage autoscaling, but this will not work with `controller.kubernetes.io/pod-deletion-cost`, as the HorizontalPodAutoscaler created by KEDA can not patch the required annotations BEFORE scaling down.

---

__Our `Celery Worker Autoscaler` can perform the following loop:__

1. Cleanup from any past loops:
    1. Remove any `controller.kubernetes.io/pod-deletion-cost` annotations
         - _NOTE: there will only be dangling annotations if Kubernetes did not remove our "chosen" Pods, or if the autoscaler crashed halfway through a loop_
         - _NOTE: we need to attempt to prevent multiple instances of our autoscaler running at a time_
     2. Send each worker Pod that we removed an annotation from an [`app.control.add_consumer()`](https://docs.celeryq.dev/en/stable/userguide/workers.html#queues-adding-consumers) command, so it resumes picking up new airflow tasks
1. Calculate the ideal number of worker `replicas` for the current task load:
     - if the `load factor` of workers is above `A` for `B` time --> increase `replicas` to meet the target `load factor`
     - if the `load factor` of workers is below `X` for `Y` time --> decrease `replicas` to meet the target `load factor`
         - _NOTE: the `load factor` is the number of available task slots which are consumed_
         - _NOTE: we should put some limit on the number of scaling decisions per `A` seconds (to prevent a yo-yo effect), (perhaps have separate limits for down and up to allow faster upscaling)_
         - _NOTE: we should have a "scaling algorithm" config, even if we only start with 1_
         - _NOTE: we should have `minium` and `maximum` replicas configs_
         - _NOTE: if using CeleryKubernetesExecutor, we must exclude tasks that are in the `AIRFLOW__CELERY_KUBERNETES_EXECUTOR__KUBERNETES_QUEUE`_
3. If `replicas` are going to be decreased by `N`:
    1. Sort the worker pods by their `pod-deletion-cost` in ascending order
         - _NOTE: the `pod-deletion-cost` is the `number of running tasks`, weighted by the `total running time` of each task (so long-running tasks are not needlessly evicted), specifically we want smaller numbers of long-running tasks to be weighted higher than larger numbers of short-running tasks_
         - _NOTE: add a DAG/Task label which will prevent any worker running it from being killed (or allow a "weighting" per Task)_
    1. Annotate the `N` worker Pods with the lowest cost Pods with the `controller.kubernetes.io/pod-deletion-cost` annotation
         - _NOTE: if there are pods in a Pending/Unready state, we can reduce `N` by this number, as Kubernetes will remove these pods first_
    2. Send each worker Pod that was annotated an [`app.control.cancel_consumer(...)`](https://docs.celeryq.dev/en/stable/userguide/workers.html#queues-canceling-consumers) command, so does not pick up new airflow tasks after being "marked" for deletion
    3. Patch the `replicas` down by `N`

__Important changes to make this work:__
- We will need to use a Deployment for the workers(rather than a StatefulSet), as `controller.kubernetes.io/pod-deletion-cost` is only for Pods in ReplicaSets
- Because `controller.kubernetes.io/pod-deletion-cost` is __alpha__ in `1.21` and __beta__ in `1.22`, for older Kubernetes versions we can let users use the [CloneSet](https://openkruise.io/en-us/docs/cloneset.html#pod-deletion-cost) from the CNCF project called [OpenKruise](https://openkruise.io/en-us/index.html) (instead of `Deployment`), as they have back-ported the `controller.kubernetes.io/pod-deletion-cost` annotation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

task-aware celery worker autoscaling (+ `pod-deletion-cost`) #339

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

task-aware celery worker autoscaling (+ pod-deletion-cost) #339

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

task-aware celery worker autoscaling (+ `pod-deletion-cost`) #339