Skip to content

Conversation

@singh1203
Copy link
Contributor

@singh1203 singh1203 commented Aug 27, 2025

Fixes - #135

Added Job Order Plugin Support

  • Add a new joborder plugin to capture and expose a stable job scheduling order.
  • Snapshot global and per-queue job order at session startup using scheduler utilities.
  • Provide an HTTP endpoint /get-job-order that returns the job order snapshot in JSON.
  • Added Unit Test for the Plugin

And I would appreciate any suggestions and guidance here.
Thank you! 🙇

@singh1203 singh1203 changed the title Add job order reflection plugin with HTTP endpoint [WIP] Add job order reflection plugin with HTTP endpoint Aug 27, 2025
Copy link
Collaborator

@enoodle enoodle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have two comments:

@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins 0.00% (ø)
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/joborder 38.10% (+38.10%) 🌟

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/factory.go 0.00% (ø) 21 (+1) 0 21 (+1)
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/joborder/job_order.go 38.10% (+38.10%) 21 (+21) 8 (+8) 13 (+13) 🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/joborder/job_order_test.go

@itsomri
Copy link
Collaborator

itsomri commented Aug 28, 2025

Hi Singh,
Thank you very much for your contribution! We appreciate it!
I think the implementation looks good - we added a few comments, mainly about naming.
I tested the plugin in a cluster and it seems to be working as expected.

@singh1203
Copy link
Contributor Author

Hi Singh, Thank you very much for your contribution! We appreciate it! I think the implementation looks good - we added a few comments, mainly about naming. I tested the plugin in a cluster and it seems to be working as expected.

Thank you @enoodle & @itsomri both of you for taking the time to review this PR 🙏
I’ll update the PR with the suggested changes (including the naming updates).

Also, @itsomri — could you please share the steps you used to test the plugin in a cluster? I’d like to try it out locally as well to verify the behaviour.

itsomri
itsomri previously approved these changes Aug 29, 2025
@itsomri
Copy link
Collaborator

itsomri commented Aug 29, 2025

Hi,

In order to test the plugin:

  1. Install kai-scheduler on a k8s cluster (a kind cluster is enough in this case)
  2. Build the scheduler image (make build from repo root)
  3. Tag and push the scheduler image to any docker registry (if using a local kind environment, you can load the image directly)
  4. Edit the kai-scheduler deployment to use the new image
  5. Edit the scheduler configmap to load the new plugin, since it's not used by default. Make sure to add it as the last plugin in the list:
apiVersion: v1
data:
  config.yaml: |
    actions: allocate, consolidation, reclaim, preempt, stalegangeviction
    tiers:
      - plugins:
          - name: predicates
          - name: proportion
          - name: priority
          - name: nodeavailability
          - name: resourcetype
          - name: podaffinity
          - name: elastic
          - name: kubeflow
          - name: ray
          - name: subgrouporder
          - name: taskorder
          - name: nominatednode
          - name: dynamicresources
          - name: gpupack
          - name: gpusharingorder
          - name: snapshot
          - name: nodeplacement
            arguments:
              cpu: binpack
              gpu: binpack
          - name: minruntime
          - name: topology
          - name: joborder
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: kai-scheduler
    meta.helm.sh/release-namespace: kai-scheduler
  labels:
    app: scheduler
    app.kubernetes.io/managed-by: Helm
  name: scheduler-config
  namespace: kai-scheduler
  1. Restart the scheduler deployment
  2. Run some jobs in different queues
  3. Port forward to the scheduler:
kubectl port-forward -nkai-scheduler deploy/scheduler 8081
  1. Navigate to the web page to see the results: http://localhost:8081/get-jobs
    Example output:
{
  "global_order": [
    {
      "id": "pg-queue-3-batch-job-3-976dx-e0ab34e4-ce31-4517-89dc-770d4efd596a",
      "priority": 50
    },
    {
      "id": "pg-queue-3-batch-job-1-llsxz-26552a7e-f07a-4fc6-a190-61ff7401067c",
      "priority": 50
    },
    {
      "id": "pg-job2-team-b-raycluster-autoscaler-34cf2ca9-8d45-47dd-858d-db716a6e53ba",
      "priority": 50
    },
    {
      "id": "pg-queue-2-batch-job-3-7rkck-2151ce18-74d7-4dc6-8a04-bdaad28931fd",
      "priority": 50
    },
    {
      "id": "pg-queue-2-batch-job-4-29xvl-bc5c56ee-9b50-4175-a852-3b17d3fb2863",
      "priority": 50
    },
    {
      "id": "pg-queue-2-batch-job-1-jzwvq-d2863624-e1ce-4223-b239-a05792377f84",
      "priority": 50
    },
    {
      "id": "pg-queue-2-batch-job-2-pjgbd-9d1ac135-2645-47bb-af7d-de455a4d4559",
      "priority": 50
    },
    {
      "id": "pg-queue-1-batch-job-2-xrgl4-076bd691-a116-4836-bf15-a9db9c222233",
      "priority": 50
    },
    {
      "id": "pg-queue-1-batch-job-3-thb2v-3c8653a5-14eb-4686-8f32-61e80f0449bb",
      "priority": 50
    },
    {
      "id": "pg-queue-1-batch-job-4-6gkxh-c7e55805-1518-45a2-9029-ed0d574a525f",
      "priority": 50
    },
    {
      "id": "pg-queue-1-batch-job-1-zqjld-e98a352b-ca69-4e44-a8e5-a17c934aa946",
      "priority": 50
    }
  ],
  "queue_order": {
    "queue-1": [
      {
        "id": "pg-queue-1-batch-job-2-xrgl4-076bd691-a116-4836-bf15-a9db9c222233",
        "priority": 50
      },
      {
        "id": "pg-queue-1-batch-job-3-thb2v-3c8653a5-14eb-4686-8f32-61e80f0449bb",
        "priority": 50
      },
      {
        "id": "pg-queue-1-batch-job-4-6gkxh-c7e55805-1518-45a2-9029-ed0d574a525f",
        "priority": 50
      },
      {
        "id": "pg-queue-1-batch-job-1-zqjld-e98a352b-ca69-4e44-a8e5-a17c934aa946",
        "priority": 50
      }
    ],
    "queue-2": [
      {
        "id": "pg-job2-team-b-raycluster-autoscaler-34cf2ca9-8d45-47dd-858d-db716a6e53ba",
        "priority": 50
      },
      {
        "id": "pg-queue-2-batch-job-3-7rkck-2151ce18-74d7-4dc6-8a04-bdaad28931fd",
        "priority": 50
      },
      {
        "id": "pg-queue-2-batch-job-4-29xvl-bc5c56ee-9b50-4175-a852-3b17d3fb2863",
        "priority": 50
      },
      {
        "id": "pg-queue-2-batch-job-1-jzwvq-d2863624-e1ce-4223-b239-a05792377f84",
        "priority": 50
      },
      {
        "id": "pg-queue-2-batch-job-2-pjgbd-9d1ac135-2645-47bb-af7d-de455a4d4559",
        "priority": 50
      }
    ],
    "queue-3": [
      {
        "id": "pg-queue-3-batch-job-3-976dx-e0ab34e4-ce31-4517-89dc-770d4efd596a",
        "priority": 50
      },
      {
        "id": "pg-queue-3-batch-job-1-llsxz-26552a7e-f07a-4fc6-a190-61ff7401067c",
        "priority": 50
      }
    ]
  }
}

@itsomri
Copy link
Collaborator

itsomri commented Aug 29, 2025

Looks like the order of imports in pkg/scheduler/plugins/factory.go needs to be fixed because of the plugin rename

@singh1203
Copy link
Contributor Author

singh1203 commented Aug 29, 2025

Looks like the order of imports in pkg/scheduler/plugins/factory.go needs to be fixed because of the plugin rename

Thank you so much @itsomri for the testing guide steps, and sure, I'm fixing import ordering as well. 👍
Also, please re-run the workflow. 🙏

@singh1203 singh1203 changed the title [WIP] Add job order reflection plugin with HTTP endpoint Add job order reflection plugin with HTTP endpoint Aug 29, 2025
@singh1203 singh1203 requested review from enoodle and itsomri August 29, 2025 11:20
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins 0.00% (ø)
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/reflectjoborder 71.43% (+71.43%) 🌟

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/factory.go 0.00% (ø) 21 (+1) 0 21 (+1)
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/reflectjoborder/reflect_job_order.go 71.43% (+71.43%) 21 (+21) 15 (+15) 6 (+6) 🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/reflectjoborder/reflect_job_order_test.go

@itsomri
Copy link
Collaborator

itsomri commented Aug 31, 2025

@singh1203 I took the liberty of rebasing and re-running the workflow - feel free to merge the PR when you're ready.

@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins 0.00% (ø)
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/reflectjoborder 71.43% (+71.43%) 🌟

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/factory.go 0.00% (ø) 21 (+1) 0 21 (+1)
github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/reflectjoborder/reflect_job_order.go 71.43% (+71.43%) 21 (+21) 15 (+15) 6 (+6) 🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/scheduler/plugins/reflectjoborder/reflect_job_order_test.go

@itsomri itsomri merged commit 97583b7 into NVIDIA:main Aug 31, 2025
4 checks passed
@itsomri
Copy link
Collaborator

itsomri commented Aug 31, 2025

Merged

@singh1203
Copy link
Contributor Author

@itsomri Thank you for merging the PR; however, I did not have permission to merge the PR and hence could not do so.

@singh1203 singh1203 deleted the reflectJobPlugin branch September 5, 2025 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants