-
Notifications
You must be signed in to change notification settings - Fork 1k
Add resource interpreter for PytorchJob #6826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add resource interpreter for PytorchJob #6826
Conversation
|
Welcome @pokerfaceSad! It looks like this is your first PR to karmada-io/karmada 🎉 |
Summary of ChangesHello @pokerfaceSad, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances Karmada's capabilities by integrating a dedicated resource interpreter for Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a resource interpreter for PyTorchJob, which is a great addition. My review focuses on improving the correctness and robustness of the Lua scripts in customizations.yaml. I've identified a few issues, including incorrect logic for handling optional components, a bug in status aggregation that could lead to data loss, and some unused variables that can be cleaned up. Applying these suggestions will make the interpreter more reliable.
...ter/default/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations.yaml
Outdated
Show resolved
Hide resolved
...ter/default/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations.yaml
Show resolved
Hide resolved
...ter/default/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations.yaml
Outdated
Show resolved
Hide resolved
...ter/default/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations.yaml
Outdated
Show resolved
Hide resolved
...ter/default/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations.yaml
Outdated
Show resolved
Hide resolved
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #6826 +/- ##
==========================================
+ Coverage 45.88% 46.46% +0.58%
==========================================
Files 690 698 +8
Lines 57392 47824 -9568
==========================================
- Hits 26333 22222 -4111
+ Misses 29423 23931 -5492
- Partials 1636 1671 +35
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
cd0c284 to
744957a
Compare
RainbowMango
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign
Sorry for letting this sit. Working on it.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a resource interpreter for PytorchJob, which is a great addition. The implementation covers component extraction, status reflection, status aggregation, and health interpretation using Lua scripts. The overall structure and logic are well-implemented. However, I've identified a significant issue in the statusAggregation logic where numeric fields with a value of 0 are incorrectly omitted from the aggregated status. This could lead to incomplete status information, particularly for replica counts. I have provided a specific comment with a suggested fix for this issue. The rest of the implementation, including the test files, appears to be correct and well-structured.
...ter/default/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations.yaml
Show resolved
Hide resolved
RainbowMango
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pokerfaceSad I just realized that the pytorch-operator now has been archived and the effort has been merged into https://github.com/kubeflow/training-operator.
So there is no PytorchJob anymore, isn't it? Can you help confirm it?
...hirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/testdata/desired-pytorchjob.yaml
Outdated
Show resolved
Hide resolved
...fault/thirdparty/resourcecustomizations/kubeflow.org/v1/PyTorchJob/customizations_tests.yaml
Outdated
Show resolved
Hide resolved
@RainbowMango Kubeflow does not use separate CRDs for each framework in v2, so there is no But the lastest PytorchJob definition can be find in legacy kubeflow v1.9.3. |
|
/assign |
|
@pokerfaceSad Given this reality, we believe it would be valuable to provide default support in Karmada to help these users transition from single-cluster to multi-cluster environments seamlessly. However, to ensure compatibility and maintainability, we need to base our PyTorchJob support on the official API definition from the release-1.9 branch of the kubeflow/trainer repository. This will ensure we're aligned with the stable version that users are actually deploying. Could you please confirm that your implementation is based on the API spec from the release-1.9 branch? If any adjustments are needed to align with that specific version, we'd appreciate you making those updates. |
OK, I will confirm it :) |
| @@ -0,0 +1,33 @@ | |||
| apiVersion: "kubeflow.org/v1" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may not need for the double quotation marks, I checked other resources and found none.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| restartPolicy: OnFailure | ||
| template: | ||
| spec: | ||
| containers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also define ResourceRequirements in the container? This way, when testing InterpretComponent, we can also cover the testing of ReplicaRequirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| -- Master Component | ||
| local master_spec = get(observedObj, {"spec", "pytorchReplicaSpecs", "Master"}) | ||
| if master_spec == nil then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ask a relatively simple question: If there is no Master defined in PyTorchJob, would it also mean that Workers are not defined? Or, even if they are defined, would it be meaningless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reminder.
It is valid even no Master defined in PyTorchJob. I will modify the logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| -- Copy basic PyTorchJob status fields | ||
| status.conditions = observedObj.status.conditions | ||
| status.replicaStatuses = observedObj.status.replicaStatuses | ||
| status.startTime = observedObj.status.startTime | ||
| status.completionTime = observedObj.status.completionTime | ||
| status.lastReconcileTime = observedObj.status.lastReconcileTime | ||
| return status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we collect all fields in the state, can we skip implementing this hook point since our default behavior is already like that? WDYT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I will remove this hook point implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| @@ -0,0 +1,29 @@ | |||
| status: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to add multiple statuses in the current file to test the status aggregation feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| if condition.type == "Failed" and condition.status == "True" then | ||
| return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no condition with Type as Failed, the final interpretation result is also true, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so we will return true in L259
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered the situation where the condition has not yet been filled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered the situation where the condition has not yet been filled?
Will observedObj.status.conditions be nil in this situation?
We will return false in L249.
XiShanYongYe-Chang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot~
In addition, I would like to ask whether it is necessary to process the current resources regarding the dependencyInterpretation hook point.
| resources: | ||
| limits: | ||
| cpu: 1 | ||
| memory: 512Mi No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help add an empty line in the end of the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I'm not sure if the default behavior already handles it. I've added the implementation for dependencyInterpretation, Please take a look. |
XiShanYongYe-Chang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, generally looks good to me, can you help squash the commits into one?
Signed-off-by: Xinyuan Lyu <[email protected]> remove unused code Signed-off-by: Xinyuan Lyu <[email protected]> remove unused func Signed-off-by: Xinyuan Lyu <[email protected]> fmt code Signed-off-by: Xinyuan Lyu <[email protected]> Handle cases where Master is undefined, and remove the statusReflection hook point. Signed-off-by: Xinyuan Lyu <[email protected]> Add multiple statuses and ResourceRequirements in test files. Signed-off-by: Xinyuan Lyu <[email protected]> Remove InterpretStatus in test files. Signed-off-by: Xinyuan Lyu <[email protected]> Add dependencyInterpretation and test case. Signed-off-by: Xinyuan Lyu <[email protected]>
9dd6dce to
b7c8fad
Compare
Done. Thanks ! |
|
/lgtm |
|
/retest |
XiShanYongYe-Chang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me merge it first.
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: XiShanYongYe-Chang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Add resource interpreter for PytorchJob
Which issue(s) this PR fixes:
Part of #6586
Does this PR introduce a user-facing change?: