Add resource interpreter for PytorchJob #6826

pokerfaceSad · 2025-10-13T14:52:15Z

What type of PR is this?

/kind feature

What this PR does / why we need it:
Add resource interpreter for PytorchJob
Which issue(s) this PR fixes:

Part of #6586

Does this PR introduce a user-facing change?:

`karmada-controller-manager`: Introduced built-in interpreter for PyTorchJob.

karmada-bot · 2025-10-13T14:52:26Z

Welcome @pokerfaceSad! It looks like this is your first PR to karmada-io/karmada 🎉

gemini-code-assist · 2025-10-13T14:52:32Z

Summary of Changes

Hello @pokerfaceSad, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Karmada's capabilities by integrating a dedicated resource interpreter for PyTorchJob objects. This interpreter, implemented via Lua scripts, enables Karmada to accurately extract resource requirements, reflect individual cluster statuses, aggregate overall job status across multiple clusters, and determine the health of distributed PyTorch training jobs. This is a crucial step towards robust management of AI/ML workloads in a multi-cluster environment.

Highlights

PyTorchJob Interpreter Added: Introduces a new resource interpreter for kubeflow.org/v1/PyTorchJob resources, enabling Karmada to understand and manage these objects.
Lua Scripted Customizations: Defines ResourceInterpreterCustomization with Lua scripts for componentResource, statusReflection, statusAggregation, and healthInterpretation.
Component Resource Extraction: The GetComponents Lua script extracts resource requirements (replicas, CPU/memory) for both Master and Worker replicas of a PyTorchJob.
Status Reflection and Aggregation: The ReflectStatus script copies essential status fields, while AggregateStatus combines status from multiple member clusters, including time fields, replica statuses, and consolidated conditions (e.g., Succeeded/Failed).
Health Interpretation: The InterpretHealth script determines the health of a PyTorchJob by checking its conditions, specifically if a 'Failed' condition is present.
Comprehensive Test Data: Includes new test YAML files (customizations_tests.yaml, desired-pytorchjob.yaml, observed-pytorchjob.yaml, status-file.yaml) to validate the interpreter's functionality across various operations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a resource interpreter for PyTorchJob, which is a great addition. My review focuses on improving the correctness and robustness of the Lua scripts in customizations.yaml. I've identified a few issues, including incorrect logic for handling optional components, a bug in status aggregation that could lead to data loss, and some unused variables that can be cleaned up. Applying these suggestions will make the interpreter more reliable.

...ter/default/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations.yaml

codecov-commenter · 2025-10-13T15:09:00Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.46%. Comparing base (9f3ca5c) to head (b7c8fad).
⚠️ Report is 145 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6826      +/-   ##
==========================================
+ Coverage   45.88%   46.46%   +0.58%     
==========================================
  Files         690      698       +8     
  Lines       57392    47824    -9568     
==========================================
- Hits        26333    22222    -4111     
+ Misses      29423    23931    -5492     
- Partials     1636     1671      +35

Flag	Coverage Δ
unittests	`46.46% <ø> (+0.58%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

RainbowMango

/assign

Sorry for letting this sit. Working on it.

RainbowMango · 2025-11-13T12:30:39Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a resource interpreter for PytorchJob, which is a great addition. The implementation covers component extraction, status reflection, status aggregation, and health interpretation using Lua scripts. The overall structure and logic are well-implemented. However, I've identified a significant issue in the statusAggregation logic where numeric fields with a value of 0 are incorrectly omitted from the aggregated status. This could lead to incomplete status information, particularly for replica counts. I have provided a specific comment with a suggested fix for this issue. The rest of the implementation, including the test files, appears to be correct and well-structured.

...ter/default/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations.yaml

RainbowMango

@pokerfaceSad I just realized that the pytorch-operator now has been archived and the effort has been merged into https://github.com/kubeflow/training-operator.

So there is no PytorchJob anymore, isn't it? Can you help confirm it?

...hirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/testdata/desired-pytorchjob.yaml

...fault/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations_tests.yaml

pokerfaceSad · 2025-11-16T15:43:54Z

@pokerfaceSad I just realized that the pytorch-operator now has been archived and the effort has been merged into https://github.com/kubeflow/training-operator.

So there is no PytorchJob anymore, isn't it? Can you help confirm it?

@RainbowMango
Sorry for the late reply.

Kubeflow does not use separate CRDs for each framework in v2, so there is no PytorchJob in kubeflow v2.
Instead, it implements all functionality within a single TrainJob CRD. And AI practitioners should use the Kubeflow Python SDK to convert training code into a TrainJob in kubeflow v2.

But the lastest PytorchJob definition can be find in legacy kubeflow v1.9.3.

XiShanYongYe-Chang · 2025-11-18T04:12:54Z

/assign

RainbowMango · 2025-11-18T08:11:37Z

@pokerfaceSad
Thanks for your confirmation.
After going through the current state of PyTorchJob in the Kubeflow ecosystem, we recognize that while PyTorchJob is indeed in a semi-deprecated state (maintained only in the release-1.9 branch without active development), it remains a mature workload that's still widely adopted in the industry. Many users continue to rely on PyTorchJob for their ML workloads.

Given this reality, we believe it would be valuable to provide default support in Karmada to help these users transition from single-cluster to multi-cluster environments seamlessly.

However, to ensure compatibility and maintainability, we need to base our PyTorchJob support on the official API definition from the release-1.9 branch of the kubeflow/trainer repository. This will ensure we're aligned with the stable version that users are actually deploying.

Could you please confirm that your implementation is based on the API spec from the release-1.9 branch? If any adjustments are needed to align with that specific version, we'd appreciate you making those updates.

pokerfaceSad · 2025-11-18T08:53:24Z

@pokerfaceSad Thanks for your confirmation. After going through the current state of PyTorchJob in the Kubeflow ecosystem, we recognize that while PyTorchJob is indeed in a semi-deprecated state (maintained only in the release-1.9 branch without active development), it remains a mature workload that's still widely adopted in the industry. Many users continue to rely on PyTorchJob for their ML workloads.

Given this reality, we believe it would be valuable to provide default support in Karmada to help these users transition from single-cluster to multi-cluster environments seamlessly.

However, to ensure compatibility and maintainability, we need to base our PyTorchJob support on the official API definition from the release-1.9 branch of the kubeflow/trainer repository. This will ensure we're aligned with the stable version that users are actually deploying.

Could you please confirm that your implementation is based on the API spec from the release-1.9 branch? If any adjustments are needed to align with that specific version, we'd appreciate you making those updates.

OK, I will confirm it :)

XiShanYongYe-Chang · 2025-11-21T07:25:11Z

...hirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/testdata/desired-pytorchjob.yaml

@@ -0,0 +1,33 @@
+apiVersion: "kubeflow.org/v1"


There may not need for the double quotation marks, I checked other resources and found none.

XiShanYongYe-Chang · 2025-11-21T08:41:06Z

...irdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/testdata/observed-pytorchjob.yaml

+      restartPolicy: OnFailure
+      template:
+        spec:
+          containers:


Can we also define ResourceRequirements in the container? This way, when testing InterpretComponent, we can also cover the testing of ReplicaRequirements.

OK, I will add it.

XiShanYongYe-Chang · 2025-11-21T09:06:03Z

...ter/default/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations.yaml

+
+          -- Master Component
+          local master_spec = get(observedObj, {"spec", "pytorchReplicaSpecs", "Master"})
+          if master_spec == nil then


Ask a relatively simple question: If there is no Master defined in PyTorchJob, would it also mean that Workers are not defined? Or, even if they are defined, would it be meaningless?

Thanks for your reminder.

It is valid even no Master defined in PyTorchJob. I will modify the logic here.

XiShanYongYe-Chang · 2025-11-21T09:11:56Z

...ter/default/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations.yaml

+          -- Copy basic PyTorchJob status fields
+          status.conditions = observedObj.status.conditions
+          status.replicaStatuses = observedObj.status.replicaStatuses
+          status.startTime = observedObj.status.startTime
+          status.completionTime = observedObj.status.completionTime
+          status.lastReconcileTime = observedObj.status.lastReconcileTime
+          return status


If we collect all fields in the state, can we skip implementing this hook point since our default behavior is already like that? WDYT

You are right, I will remove this hook point implement.

XiShanYongYe-Chang · 2025-11-21T09:35:05Z

...fault/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/testdata/status-file.yaml

@@ -0,0 +1,29 @@
+status:


We may need to add multiple statuses in the current file to test the status aggregation feature.

XiShanYongYe-Chang · 2025-11-21T09:40:48Z

...ter/default/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations.yaml

+            if condition.type == "Failed" and condition.status == "True" then
+              return false


If there is no condition with Type as Failed, the final interpretation result is also true, right?

Yes, so we will return true in L259

Have you considered the situation where the condition has not yet been filled?

Have you considered the situation where the condition has not yet been filled?

Will observedObj.status.conditions be nil in this situation?

We will return false in L249.

XiShanYongYe-Chang

Thanks a lot~
In addition, I would like to ask whether it is necessary to process the current resources regarding the dependencyInterpretation hook point.

XiShanYongYe-Chang · 2025-11-22T11:12:07Z

...hirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/testdata/desired-pytorchjob.yaml

+              resources:
+                limits:
+                  cpu: 1
+                  memory: 512Mi


Can you help add an empty line in the end of the file?

pokerfaceSad · 2025-11-22T12:31:01Z

Thanks a lot~ In addition, I would like to ask whether it is necessary to process the current resources regarding the dependencyInterpretation hook point.

I'm not sure if the default behavior already handles it.

I've added the implementation for dependencyInterpretation, Please take a look.

XiShanYongYe-Chang

Thanks, generally looks good to me, can you help squash the commits into one?

Signed-off-by: Xinyuan Lyu <[email protected]> remove unused code Signed-off-by: Xinyuan Lyu <[email protected]> remove unused func Signed-off-by: Xinyuan Lyu <[email protected]> fmt code Signed-off-by: Xinyuan Lyu <[email protected]> Handle cases where Master is undefined, and remove the statusReflection hook point. Signed-off-by: Xinyuan Lyu <[email protected]> Add multiple statuses and ResourceRequirements in test files. Signed-off-by: Xinyuan Lyu <[email protected]> Remove InterpretStatus in test files. Signed-off-by: Xinyuan Lyu <[email protected]> Add dependencyInterpretation and test case. Signed-off-by: Xinyuan Lyu <[email protected]>

pokerfaceSad · 2025-11-24T02:59:04Z

Thanks, generally looks good to me, can you help squash the commits into one?

Done.

Thanks !

XiShanYongYe-Chang · 2025-11-24T03:00:54Z

/lgtm
Would you mind taking another look? @RainbowMango

XiShanYongYe-Chang · 2025-11-24T04:08:58Z

/retest

XiShanYongYe-Chang

Let me merge it first.
/approve

karmada-bot · 2025-11-24T12:04:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: XiShanYongYe-Chang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/resourceinterpreter/default/thirdparty/OWNERS~~ [XiShanYongYe-Chang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 13, 2025

karmada-bot requested review from chaunceyjiang and yike21 October 13, 2025 14:52

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 13, 2025

gemini-code-assist bot reviewed Oct 13, 2025

View reviewed changes

pokerfaceSad force-pushed the pytorchjobs_interpreter branch 4 times, most recently from cd0c284 to 744957a Compare October 14, 2025 07:45

RainbowMango reviewed Nov 13, 2025

View reviewed changes

karmada-bot assigned RainbowMango Nov 13, 2025

RainbowMango mentioned this pull request Nov 13, 2025

[lfx-mentorship-2025-Sept-Nov] Support TFJob and PyTorchJob in Karmada via Default Resource Interpreters #6586

Closed

7 tasks

gemini-code-assist bot reviewed Nov 13, 2025

View reviewed changes

...ter/default/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations.yaml Show resolved Hide resolved

RainbowMango reviewed Nov 13, 2025

View reviewed changes

...hirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/testdata/desired-pytorchjob.yaml Outdated Show resolved Hide resolved

...fault/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations_tests.yaml Outdated Show resolved Hide resolved

karmada-bot assigned XiShanYongYe-Chang Nov 18, 2025

XiShanYongYe-Chang reviewed Nov 21, 2025

View reviewed changes

XiShanYongYe-Chang reviewed Nov 22, 2025

View reviewed changes

XiShanYongYe-Chang reviewed Nov 24, 2025

View reviewed changes

pokerfaceSad force-pushed the pytorchjobs_interpreter branch from 9dd6dce to b7c8fad Compare November 24, 2025 02:57

karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 24, 2025

XiShanYongYe-Chang reviewed Nov 24, 2025

View reviewed changes

karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 24, 2025

karmada-bot merged commit a0dbc48 into karmada-io:master Nov 24, 2025
26 of 27 checks passed

		if condition.type == "Failed" and condition.status == "True" then
		return false

Add resource interpreter for PytorchJob #6826

Add resource interpreter for PytorchJob #6826

Uh oh!

Conversation

pokerfaceSad commented Oct 13, 2025

Uh oh!

karmada-bot commented Oct 13, 2025

Uh oh!

gemini-code-assist bot commented Oct 13, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

RainbowMango left a comment

Choose a reason for hiding this comment

Uh oh!

RainbowMango commented Nov 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

RainbowMango left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pokerfaceSad commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XiShanYongYe-Chang commented Nov 18, 2025

Uh oh!

RainbowMango commented Nov 18, 2025

Uh oh!

pokerfaceSad commented Nov 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 13, 2025 •

edited

Loading

pokerfaceSad commented Nov 16, 2025 •

edited

Loading