Add custom interpreter for RayJob #6947

owenowenisme · 2025-11-22T17:04:23Z

What type of PR is this?
/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Part of #6588
Special notes for your reviewer:

1. InterpretComponent

Command:

karmadactl interpret -f customizations.yaml --operation InterpretComponent \
  --observed-file testdata/observed-rayjob.yaml

Result:

components:
- name: ray-head
  replicaRequirements:
    resourceRequest:
        cpu: "2"
        memory: 4Gi
  replicas: 1
- name: small-workers
  replicaRequirements:
    resourceRequest:
        cpu: "1"
        memory: 2Gi
  replicas: 2

2. InterpretHealth

Command:

karmadactl interpret -f customizations.yaml --operation InterpretHealth \
  --observed-file testdata/observed-rayjob.yaml

Result:

healthy: true

3. AggregateStatus (Multi-RayJob)

Command:

❯ karmadactl interpret -f customizations.yaml --operation AggregateStatus \                                                         (base) 
        --desired-file testdata/desired-rayjob.yaml \
        --status-file testdata/status-file.yaml
---
# [1/1] aggregatedStatus:
apiVersion: ray.io/v1
kind: RayJob
metadata:
    name: sample-rayjob
    namespace: default
spec:
    entrypoint: python /home/ray/samples/sample_code.py
    rayClusterSpec:
        headGroupSpec:
            template:
                spec:
                    containers:
                        - image: rayproject/ray:2.46.0
                          name: ray-head
                          ports:
                            - containerPort: 6379
                              name: gcs-server
                            - containerPort: 8265
                              name: dashboard
                            - containerPort: 10001
                              name: client
                          resources:
                            limits:
                                cpu: "2"
                                memory: 4Gi
                            requests:
                                cpu: "2"
                                memory: 4Gi
        rayVersion: 2.46.0
        workerGroupSpecs:
            - groupName: small-workers
              maxReplicas: 5
              minReplicas: 1
              replicas: 2
              template:
                spec:
                    containers:
                        - image: rayproject/ray:2.46.0
                          name: ray-worker
                          resources:
                            limits:
                                cpu: "1"
                                memory: 2Gi
                            requests:
                                cpu: "1"
                                memory: 2Gi
    shutdownAfterJobFinishes: true
    ttlSecondsAfterFinished: 60
status:
    failed: 0
    jobDeploymentStatus: Running
    jobStatus: RUNNING
    rayClusterStatus:
        availableWorkerReplicas: 5
        conditions:
            - clusterName: member1
              lastTransitionTime: "2025-11-22T10:29:30Z"
              message: ""
              reason: HeadPodRunningAndReady
              status: "True"
              type: HeadPodReady
            - clusterName: member1
              lastTransitionTime: "2025-11-22T10:29:45Z"
              message: All Ray Pods are ready for the first time
              reason: AllPodRunningAndReadyFirstTime
              status: "True"
              type: RayClusterProvisioned
            - clusterName: member2
              lastTransitionTime: "2025-11-22T10:31:10Z"
              message: ""
              reason: HeadPodRunningAndReady
              status: "True"
              type: HeadPodReady
            - clusterName: member2
              lastTransitionTime: "2025-11-22T10:31:25Z"
              message: All Ray Pods are ready for the first time
              reason: AllPodRunningAndReadyFirstTime
              status: "True"
              type: RayClusterProvisioned
            - clusterName: member2
              lastTransitionTime: "2025-11-22T10:30:50Z"
              message: ""
              reason: RayClusterSuspended
              status: "False"
              type: RayClusterSuspended
        desiredCPU: "10"
        desiredGPU: "1"
        desiredMemory: 20Gi
        desiredTPU: "0"
        desiredWorkerReplicas: 5
        maxWorkerReplicas: 15
        minWorkerReplicas: 3
        readyWorkerReplicas: 5
        state: ready
    startTime: "2025-11-22T10:30:00Z"
    succeeded: 0

Does this PR introduce a user-facing change?:

karmada-bot · 2025-11-22T17:04:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yike21 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/resourceinterpreter/default/thirdparty/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2025-11-22T17:04:42Z

Summary of Changes

Hello @owenowenisme, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a custom interpreter for RayJob resources, significantly enhancing Karmada's ability to manage and observe distributed Ray applications across multiple clusters. It provides robust mechanisms for understanding the resource footprint of Ray components, accurately assessing the health of Ray jobs, and aggregating their status from various deployment locations into a unified view.

Highlights

RayJob Custom Interpreter Added: A new ResourceInterpreterCustomization has been added for ray.io/v1/RayJob resources, enabling Karmada to understand and manage Ray workloads.
Component Resource Interpretation: Implemented InterpretComponent using Lua to accurately extract resource requirements (CPU, memory, replicas) for both head and worker groups defined within a RayJob's rayClusterSpec.
Health Interpretation Logic: Implemented InterpretHealth using Lua to determine the health status of a RayJob. This logic considers various jobDeploymentStatus and jobStatus values, marking Running or Complete/SUCCEEDED as healthy, and Failed, ValidationFailed, or STOPPED as unhealthy, with other states considered transitional.
Status Aggregation Strategy: Implemented AggregateStatus using Lua to combine status from multiple RayJob instances across different clusters. It applies specific aggregation strategies, such as 'worst state wins' for jobDeploymentStatus, 'first non-nil' for identifiers like jobId, summing for succeeded/failed counts, and taking the earliest for startTime.
Comprehensive Test Data and Definitions: Included new test data and test definitions (customizations_tests.yaml, desired-rayjob.yaml, observed-rayjob.yaml, status-file.yaml) to thoroughly validate the functionality of the new RayJob interpreter across all defined operations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull request overview

This PR adds a custom resource interpreter for RayJob (ray.io/v1) to enable Karmada to understand and manage Ray distributed computing jobs across multiple clusters. The implementation provides component resource extraction, health checking, and status aggregation capabilities for RayJob workloads.

Key changes:

Custom interpreter implementation with Lua scripts for component, health, and status operations
Test data files demonstrating RayJob structure and multi-cluster scenarios
Test configuration to validate the interpreter operations

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
customizations.yaml	Implements Lua scripts for component extraction, health interpretation, and status aggregation for RayJob resources
customizations_tests.yaml	Defines test cases for the three interpreter operations
observed-rayjob.yaml	Test data representing an observed RayJob instance with status
desired-rayjob.yaml	Test data representing the desired RayJob specification
status-file.yaml	Test data with status information from multiple clusters for aggregation testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...ceinterpreter/default/thirdparty/resourcecustomizations/ray.io/v1/RayJob/customizations.yaml

gemini-code-assist

Code Review

This pull request introduces a custom resource interpreter for RayJob resources, including Lua scripts for component discovery, health assessment, and status aggregation. The overall implementation is solid and well-tested. I have two main suggestions for improvement. First, the health interpretation logic can be simplified for better readability and conciseness. Second, and more critically, the status aggregation for rayClusterStatus should sum up resource metrics from all member clusters instead of just taking the values from the first one, which would provide a more accurate representation of the total resource state. I've provided specific suggestions for these changes.

...ceinterpreter/default/thirdparty/resourcecustomizations/ray.io/v1/RayJob/customizations.yaml

codecov-commenter · 2025-11-22T17:24:51Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.46%. Comparing base (1d5f925) to head (0230887).
⚠️ Report is 4 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6947      +/-   ##
==========================================
+ Coverage   46.45%   46.46%   +0.01%     
==========================================
  Files         698      698              
  Lines       47809    47824      +15     
==========================================
+ Hits        22208    22222      +14     
- Misses      23930    23932       +2     
+ Partials     1671     1670       -1

Flag	Coverage Δ
unittests	`46.46% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: You-Cheng Lin (Owen) <[email protected]>

Copilot AI review requested due to automatic review settings November 22, 2025 17:04

karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 22, 2025

karmada-bot requested review from mszacillo and yike21 November 22, 2025 17:04

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 22, 2025

Copilot started reviewing on behalf of owenowenisme November 22, 2025 17:04 View session

Copilot finished reviewing on behalf of owenowenisme November 22, 2025 17:05

Copilot AI reviewed Nov 22, 2025

View reviewed changes

...ceinterpreter/default/thirdparty/resourcecustomizations/ray.io/v1/RayJob/customizations.yaml Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 22, 2025

View reviewed changes

...ceinterpreter/default/thirdparty/resourcecustomizations/ray.io/v1/RayJob/customizations.yaml Show resolved Hide resolved

...ceinterpreter/default/thirdparty/resourcecustomizations/ray.io/v1/RayJob/customizations.yaml Show resolved Hide resolved

owenowenisme force-pushed the lfx/add-interpret-for-ray-job branch from ed2fcb2 to 4a9fa77 Compare November 22, 2025 17:47

update

0230887

Signed-off-by: You-Cheng Lin (Owen) <[email protected]>

owenowenisme force-pushed the lfx/add-interpret-for-ray-job branch from 4a9fa77 to 0230887 Compare November 23, 2025 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add custom interpreter for RayJob #6947

Add custom interpreter for RayJob #6947

owenowenisme commented Nov 22, 2025 •

edited

Loading

Uh oh!

karmada-bot commented Nov 22, 2025

Uh oh!

gemini-code-assist bot commented Nov 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Nov 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add custom interpreter for RayJob #6947

Are you sure you want to change the base?

Add custom interpreter for RayJob #6947

Conversation

owenowenisme commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. InterpretComponent

2. InterpretHealth

3. AggregateStatus (Multi-RayJob)

Uh oh!

karmada-bot commented Nov 22, 2025

Uh oh!

gemini-code-assist bot commented Nov 22, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

owenowenisme commented Nov 22, 2025 •

edited

Loading

codecov-commenter commented Nov 22, 2025 •

edited

Loading