Skip to content

Conversation

@pokerfaceSad
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:
Add resource interpreter for PytorchJob
Which issue(s) this PR fixes:

Part of #6586

Does this PR introduce a user-facing change?:

`karmada-controller-manager`: Introduced built-in interpreter for PyTorchJob.

@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 13, 2025
@karmada-bot
Copy link
Collaborator

Welcome @pokerfaceSad! It looks like this is your first PR to karmada-io/karmada 🎉

@gemini-code-assist
Copy link

Summary of Changes

Hello @pokerfaceSad, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Karmada's capabilities by integrating a dedicated resource interpreter for PyTorchJob objects. This interpreter, implemented via Lua scripts, enables Karmada to accurately extract resource requirements, reflect individual cluster statuses, aggregate overall job status across multiple clusters, and determine the health of distributed PyTorch training jobs. This is a crucial step towards robust management of AI/ML workloads in a multi-cluster environment.

Highlights

  • PyTorchJob Interpreter Added: Introduces a new resource interpreter for kubeflow.org/v1/PyTorchJob resources, enabling Karmada to understand and manage these objects.
  • Lua Scripted Customizations: Defines ResourceInterpreterCustomization with Lua scripts for componentResource, statusReflection, statusAggregation, and healthInterpretation.
  • Component Resource Extraction: The GetComponents Lua script extracts resource requirements (replicas, CPU/memory) for both Master and Worker replicas of a PyTorchJob.
  • Status Reflection and Aggregation: The ReflectStatus script copies essential status fields, while AggregateStatus combines status from multiple member clusters, including time fields, replica statuses, and consolidated conditions (e.g., Succeeded/Failed).
  • Health Interpretation: The InterpretHealth script determines the health of a PyTorchJob by checking its conditions, specifically if a 'Failed' condition is present.
  • Comprehensive Test Data: Includes new test YAML files (customizations_tests.yaml, desired-pytorchjob.yaml, observed-pytorchjob.yaml, status-file.yaml) to validate the interpreter's functionality across various operations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 13, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a resource interpreter for PyTorchJob, which is a great addition. My review focuses on improving the correctness and robustness of the Lua scripts in customizations.yaml. I've identified a few issues, including incorrect logic for handling optional components, a bug in status aggregation that could lead to data loss, and some unused variables that can be cleaned up. Applying these suggestions will make the interpreter more reliable.

@codecov-commenter
Copy link

codecov-commenter commented Oct 13, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.46%. Comparing base (9f3ca5c) to head (b7c8fad).
⚠️ Report is 145 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6826      +/-   ##
==========================================
+ Coverage   45.88%   46.46%   +0.58%     
==========================================
  Files         690      698       +8     
  Lines       57392    47824    -9568     
==========================================
- Hits        26333    22222    -4111     
+ Misses      29423    23931    -5492     
- Partials     1636     1671      +35     
Flag Coverage Δ
unittests 46.46% <ø> (+0.58%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pokerfaceSad pokerfaceSad force-pushed the pytorchjobs_interpreter branch 4 times, most recently from cd0c284 to 744957a Compare October 14, 2025 07:45
Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign

Sorry for letting this sit. Working on it.

@RainbowMango
Copy link
Member

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a resource interpreter for PytorchJob, which is a great addition. The implementation covers component extraction, status reflection, status aggregation, and health interpretation using Lua scripts. The overall structure and logic are well-implemented. However, I've identified a significant issue in the statusAggregation logic where numeric fields with a value of 0 are incorrectly omitted from the aggregated status. This could lead to incomplete status information, particularly for replica counts. I have provided a specific comment with a suggested fix for this issue. The rest of the implementation, including the test files, appears to be correct and well-structured.

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pokerfaceSad I just realized that the pytorch-operator now has been archived and the effort has been merged into https://github.com/kubeflow/training-operator.

So there is no PytorchJob anymore, isn't it? Can you help confirm it?

@pokerfaceSad
Copy link
Contributor Author

pokerfaceSad commented Nov 16, 2025

@pokerfaceSad I just realized that the pytorch-operator now has been archived and the effort has been merged into https://github.com/kubeflow/training-operator.

So there is no PytorchJob anymore, isn't it? Can you help confirm it?

@RainbowMango
Sorry for the late reply.

Kubeflow does not use separate CRDs for each framework in v2, so there is no PytorchJob in kubeflow v2.
Instead, it implements all functionality within a single TrainJob CRD. And AI practitioners should use the Kubeflow Python SDK to convert training code into a TrainJob in kubeflow v2.

But the lastest PytorchJob definition can be find in legacy kubeflow v1.9.3.

@XiShanYongYe-Chang
Copy link
Member

/assign

@RainbowMango
Copy link
Member

@pokerfaceSad
Thanks for your confirmation.
After going through the current state of PyTorchJob in the Kubeflow ecosystem, we recognize that while PyTorchJob is indeed in a semi-deprecated state (maintained only in the release-1.9 branch without active development), it remains a mature workload that's still widely adopted in the industry. Many users continue to rely on PyTorchJob for their ML workloads.

Given this reality, we believe it would be valuable to provide default support in Karmada to help these users transition from single-cluster to multi-cluster environments seamlessly.

However, to ensure compatibility and maintainability, we need to base our PyTorchJob support on the official API definition from the release-1.9 branch of the kubeflow/trainer repository. This will ensure we're aligned with the stable version that users are actually deploying.

Could you please confirm that your implementation is based on the API spec from the release-1.9 branch? If any adjustments are needed to align with that specific version, we'd appreciate you making those updates.

@pokerfaceSad
Copy link
Contributor Author

@pokerfaceSad Thanks for your confirmation. After going through the current state of PyTorchJob in the Kubeflow ecosystem, we recognize that while PyTorchJob is indeed in a semi-deprecated state (maintained only in the release-1.9 branch without active development), it remains a mature workload that's still widely adopted in the industry. Many users continue to rely on PyTorchJob for their ML workloads.

Given this reality, we believe it would be valuable to provide default support in Karmada to help these users transition from single-cluster to multi-cluster environments seamlessly.

However, to ensure compatibility and maintainability, we need to base our PyTorchJob support on the official API definition from the release-1.9 branch of the kubeflow/trainer repository. This will ensure we're aligned with the stable version that users are actually deploying.

Could you please confirm that your implementation is based on the API spec from the release-1.9 branch? If any adjustments are needed to align with that specific version, we'd appreciate you making those updates.

OK, I will confirm it :)

@@ -0,0 +1,33 @@
apiVersion: "kubeflow.org/v1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may not need for the double quotation marks, I checked other resources and found none.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

restartPolicy: OnFailure
template:
spec:
containers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also define ResourceRequirements in the container? This way, when testing InterpretComponent, we can also cover the testing of ReplicaRequirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


-- Master Component
local master_spec = get(observedObj, {"spec", "pytorchReplicaSpecs", "Master"})
if master_spec == nil then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ask a relatively simple question: If there is no Master defined in PyTorchJob, would it also mean that Workers are not defined? Or, even if they are defined, would it be meaningless?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reminder.

It is valid even no Master defined in PyTorchJob. I will modify the logic here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 91 to 97
-- Copy basic PyTorchJob status fields
status.conditions = observedObj.status.conditions
status.replicaStatuses = observedObj.status.replicaStatuses
status.startTime = observedObj.status.startTime
status.completionTime = observedObj.status.completionTime
status.lastReconcileTime = observedObj.status.lastReconcileTime
return status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we collect all fields in the state, can we skip implementing this hook point since our default behavior is already like that? WDYT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I will remove this hook point implement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,29 @@
status:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to add multiple statuses in the current file to test the status aggregation feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +255 to +238
if condition.type == "Failed" and condition.status == "True" then
return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no condition with Type as Failed, the final interpretation result is also true, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so we will return true in L259

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered the situation where the condition has not yet been filled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered the situation where the condition has not yet been filled?

Will observedObj.status.conditions be nil in this situation?

We will return false in L249.

Copy link
Member

@XiShanYongYe-Chang XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot~
In addition, I would like to ask whether it is necessary to process the current resources regarding the dependencyInterpretation hook point.

resources:
limits:
cpu: 1
memory: 512Mi No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help add an empty line in the end of the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@pokerfaceSad
Copy link
Contributor Author

Thanks a lot~ In addition, I would like to ask whether it is necessary to process the current resources regarding the dependencyInterpretation hook point.

I'm not sure if the default behavior already handles it.

I've added the implementation for dependencyInterpretation, Please take a look.

Copy link
Member

@XiShanYongYe-Chang XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, generally looks good to me, can you help squash the commits into one?

Signed-off-by: Xinyuan Lyu <[email protected]>

remove unused code

Signed-off-by: Xinyuan Lyu <[email protected]>

remove unused func

Signed-off-by: Xinyuan Lyu <[email protected]>

fmt code

Signed-off-by: Xinyuan Lyu <[email protected]>

Handle cases where Master is undefined, and remove the statusReflection hook point.

Signed-off-by: Xinyuan Lyu <[email protected]>

Add multiple statuses and  ResourceRequirements in test files.

Signed-off-by: Xinyuan Lyu <[email protected]>

Remove InterpretStatus in test files.

Signed-off-by: Xinyuan Lyu <[email protected]>

Add dependencyInterpretation and test case.

Signed-off-by: Xinyuan Lyu <[email protected]>
@pokerfaceSad pokerfaceSad force-pushed the pytorchjobs_interpreter branch from 9dd6dce to b7c8fad Compare November 24, 2025 02:57
@pokerfaceSad
Copy link
Contributor Author

Thanks, generally looks good to me, can you help squash the commits into one?

Done.

Thanks !

@XiShanYongYe-Chang
Copy link
Member

/lgtm
Would you mind taking another look? @RainbowMango

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 24, 2025
@XiShanYongYe-Chang
Copy link
Member

/retest

Copy link
Member

@XiShanYongYe-Chang XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me merge it first.
/approve

@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: XiShanYongYe-Chang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 24, 2025
@karmada-bot karmada-bot merged commit a0dbc48 into karmada-io:master Nov 24, 2025
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants