Skip to content

Conversation

@owenowenisme
Copy link

@owenowenisme owenowenisme commented Nov 22, 2025

What type of PR is this?
/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Part of #6588
Special notes for your reviewer:

1. InterpretComponent

Command:

karmadactl interpret -f customizations.yaml --operation InterpretComponent \
  --observed-file testdata/observed-rayjob.yaml

Result:

components:
- name: ray-head
  replicaRequirements:
    resourceRequest:
        cpu: "2"
        memory: 4Gi
  replicas: 1
- name: small-workers
  replicaRequirements:
    resourceRequest:
        cpu: "1"
        memory: 2Gi
  replicas: 2

2. InterpretHealth

Command:

karmadactl interpret -f customizations.yaml --operation InterpretHealth \
  --observed-file testdata/observed-rayjob.yaml

Result:

healthy: true

3. AggregateStatus (Multi-RayJob)

Command:

❯ karmadactl interpret -f customizations.yaml --operation AggregateStatus \                                                         (base) 
        --desired-file testdata/desired-rayjob.yaml \
        --status-file testdata/status-file.yaml
---
# [1/1] aggregatedStatus:
apiVersion: ray.io/v1
kind: RayJob
metadata:
    name: sample-rayjob
    namespace: default
spec:
    entrypoint: python /home/ray/samples/sample_code.py
    rayClusterSpec:
        headGroupSpec:
            template:
                spec:
                    containers:
                        - image: rayproject/ray:2.46.0
                          name: ray-head
                          ports:
                            - containerPort: 6379
                              name: gcs-server
                            - containerPort: 8265
                              name: dashboard
                            - containerPort: 10001
                              name: client
                          resources:
                            limits:
                                cpu: "2"
                                memory: 4Gi
                            requests:
                                cpu: "2"
                                memory: 4Gi
        rayVersion: 2.46.0
        workerGroupSpecs:
            - groupName: small-workers
              maxReplicas: 5
              minReplicas: 1
              replicas: 2
              template:
                spec:
                    containers:
                        - image: rayproject/ray:2.46.0
                          name: ray-worker
                          resources:
                            limits:
                                cpu: "1"
                                memory: 2Gi
                            requests:
                                cpu: "1"
                                memory: 2Gi
    shutdownAfterJobFinishes: true
    ttlSecondsAfterFinished: 60
status:
    failed: 0
    jobDeploymentStatus: Running
    jobStatus: RUNNING
    rayClusterStatus:
        availableWorkerReplicas: 5
        conditions:
            - clusterName: member1
              lastTransitionTime: "2025-11-22T10:29:30Z"
              message: ""
              reason: HeadPodRunningAndReady
              status: "True"
              type: HeadPodReady
            - clusterName: member1
              lastTransitionTime: "2025-11-22T10:29:45Z"
              message: All Ray Pods are ready for the first time
              reason: AllPodRunningAndReadyFirstTime
              status: "True"
              type: RayClusterProvisioned
            - clusterName: member2
              lastTransitionTime: "2025-11-22T10:31:10Z"
              message: ""
              reason: HeadPodRunningAndReady
              status: "True"
              type: HeadPodReady
            - clusterName: member2
              lastTransitionTime: "2025-11-22T10:31:25Z"
              message: All Ray Pods are ready for the first time
              reason: AllPodRunningAndReadyFirstTime
              status: "True"
              type: RayClusterProvisioned
            - clusterName: member2
              lastTransitionTime: "2025-11-22T10:30:50Z"
              message: ""
              reason: RayClusterSuspended
              status: "False"
              type: RayClusterSuspended
        desiredCPU: "10"
        desiredGPU: "1"
        desiredMemory: 20Gi
        desiredTPU: "0"
        desiredWorkerReplicas: 5
        maxWorkerReplicas: 15
        minWorkerReplicas: 3
        readyWorkerReplicas: 5
        state: ready
    startTime: "2025-11-22T10:30:00Z"
    succeeded: 0

Does this PR introduce a user-facing change?:


Copilot AI review requested due to automatic review settings November 22, 2025 17:04
@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 22, 2025
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yike21 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 22, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @owenowenisme, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a custom interpreter for RayJob resources, significantly enhancing Karmada's ability to manage and observe distributed Ray applications across multiple clusters. It provides robust mechanisms for understanding the resource footprint of Ray components, accurately assessing the health of Ray jobs, and aggregating their status from various deployment locations into a unified view.

Highlights

  • RayJob Custom Interpreter Added: A new ResourceInterpreterCustomization has been added for ray.io/v1/RayJob resources, enabling Karmada to understand and manage Ray workloads.
  • Component Resource Interpretation: Implemented InterpretComponent using Lua to accurately extract resource requirements (CPU, memory, replicas) for both head and worker groups defined within a RayJob's rayClusterSpec.
  • Health Interpretation Logic: Implemented InterpretHealth using Lua to determine the health status of a RayJob. This logic considers various jobDeploymentStatus and jobStatus values, marking Running or Complete/SUCCEEDED as healthy, and Failed, ValidationFailed, or STOPPED as unhealthy, with other states considered transitional.
  • Status Aggregation Strategy: Implemented AggregateStatus using Lua to combine status from multiple RayJob instances across different clusters. It applies specific aggregation strategies, such as 'worst state wins' for jobDeploymentStatus, 'first non-nil' for identifiers like jobId, summing for succeeded/failed counts, and taking the earliest for startTime.
  • Comprehensive Test Data and Definitions: Included new test data and test definitions (customizations_tests.yaml, desired-rayjob.yaml, observed-rayjob.yaml, status-file.yaml) to thoroughly validate the functionality of the new RayJob interpreter across all defined operations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a custom resource interpreter for RayJob (ray.io/v1) to enable Karmada to understand and manage Ray distributed computing jobs across multiple clusters. The implementation provides component resource extraction, health checking, and status aggregation capabilities for RayJob workloads.

Key changes:

  • Custom interpreter implementation with Lua scripts for component, health, and status operations
  • Test data files demonstrating RayJob structure and multi-cluster scenarios
  • Test configuration to validate the interpreter operations

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
customizations.yaml Implements Lua scripts for component extraction, health interpretation, and status aggregation for RayJob resources
customizations_tests.yaml Defines test cases for the three interpreter operations
observed-rayjob.yaml Test data representing an observed RayJob instance with status
desired-rayjob.yaml Test data representing the desired RayJob specification
status-file.yaml Test data with status information from multiple clusters for aggregation testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a custom resource interpreter for RayJob resources, including Lua scripts for component discovery, health assessment, and status aggregation. The overall implementation is solid and well-tested. I have two main suggestions for improvement. First, the health interpretation logic can be simplified for better readability and conciseness. Second, and more critically, the status aggregation for rayClusterStatus should sum up resource metrics from all member clusters instead of just taking the values from the first one, which would provide a more accurate representation of the total resource state. I've provided specific suggestions for these changes.

@codecov-commenter
Copy link

codecov-commenter commented Nov 22, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.46%. Comparing base (1d5f925) to head (0230887).
⚠️ Report is 4 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6947      +/-   ##
==========================================
+ Coverage   46.45%   46.46%   +0.01%     
==========================================
  Files         698      698              
  Lines       47809    47824      +15     
==========================================
+ Hits        22208    22222      +14     
- Misses      23930    23932       +2     
+ Partials     1671     1670       -1     
Flag Coverage Δ
unittests 46.46% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@owenowenisme owenowenisme force-pushed the lfx/add-interpret-for-ray-job branch from ed2fcb2 to 4a9fa77 Compare November 22, 2025 17:47
Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
@owenowenisme owenowenisme force-pushed the lfx/add-interpret-for-ray-job branch from 4a9fa77 to 0230887 Compare November 23, 2025 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants