Skip to content

Why does kai-scheduler deploy multiple controllers instead of consolidating them into a single controller? #658

@googs1025

Description

@googs1025
➜  dynamo helm upgrade -i kai-scheduler oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler -n kai-scheduler --create-namespace --version v0.10.0

Release "kai-scheduler" does not exist. Installing it now.
Pulled: ghcr.io/nvidia/kai-scheduler/kai-scheduler:v0.10.0
Digest: sha256:d81ec1236acbe7d6cdb6c9e8f3986ce46f8c08d27cabe6b4e586fe0138d27755
NAME: kai-scheduler
LAST DEPLOYED: Wed Nov 19 19:00:30 2025
NAMESPACE: kai-scheduler
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None

While deploying the kai-scheduler Helm chart, I noticed that it deploys several distinct controller components within the kai-scheduler namespace:

➜  dynamo kubectl get pods -nkai-scheduler
NAME                                     READY   STATUS    RESTARTS   AGE
admission-7786b67c67-bwl8d               1/1     Running   0          19s
binder-665c5f6f7d-xf4t8                  1/1     Running   0          18s
kai-operator-6c7598cd96-5hk6v            1/1     Running   0          27s
kai-scheduler-default-7b9fbfbc97-vsftq   1/1     Running   0          19s
pod-grouper-5db6d945b7-xtpkt             1/1     Running   0          19s
podgroup-controller-5fc6cbc67c-mrtnw     1/1     Running   0          18s
queue-controller-89fd4f965-6q7x6         1/1     Running   0          18s

Question

Given that these components are all part of the same system (kai-scheduler), why is there a need to split them into multiple controllers?

Current Concerns:

  • Operational Complexity: Managing multiple controllers increases operational overhead. Each component needs its own monitoring, logging, and debugging setup.
  • Troubleshooting Difficulty: When an issue arises, it's more challenging to pinpoint which controller is at fault. Logs are spread across multiple pods, making it harder to correlate events.

Suggestion

Would it be possible to consolidate these controllers into a single controller pod? This could simplify deployment, reduce operational complexity, and make troubleshooting easier.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions