Skip to content

Conversation

@riprasad
Copy link
Member

@riprasad riprasad commented Sep 3, 2025

Ref: https://issues.redhat.com/browse/RHOAIENG-32708

Summary by CodeRabbit

  • New Features
    • New TensorFlow ROCm (Python 3.12, UBI9) images are now available for Jupyter Workbench notebooks and pipeline runtimes, enabling AMD GPU acceleration with ROCm on TF 3.12. Select these images when creating or updating a notebook or pipeline to leverage ROCm-enabled TensorFlow. Ideal for training and inference on supported AMD GPUs, improving environment choice and flexibility for ML workflows.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 3, 2025

Walkthrough

Added two ROCm TensorFlow 3.12 image mappings in workbenches initialization for notebook and pipeline images, updating nbImgsManifestInfo for params-latest.env. No control flow or error handling changes; no public API changes.

Changes

Cohort / File(s) Summary
Workbench image mappings
internal/controller/components/workbenches/workbenches.go
Added mappings for ROCm TensorFlow 3.12 images: jupyter and pipeline runtime, wired to RELATED_IMAGE env vars via nbImgsManifestInfo for params-latest.env.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

A bunny taps the keys—two carrots in a row,
New ROCm paths to TensorFlow begin to glow.
Notebooks hop, pipelines flow,
Mapped by names we neatly stow.
Quiet changes, onward we go—
Into the fields where tensors grow. 🥕🐇

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-override-for-new-workbenches

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@openshift-ci openshift-ci bot requested review from jctanner and mlassak September 3, 2025 17:47
@openshift-ci
Copy link

openshift-ci bot commented Sep 3, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign steventobin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jiridanek
Copy link
Member

/lgtm

this follows the established pattern and matches the change in

as @coderabbitai can surely testify to

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 3, 2025

Tip

For best results, initiate chat on the files or code changes.

  1. ODH operator repo: Adds the corresponding overrides to make these new images available in the OpenDataHub deployment

This follows the established pattern where new workbench images are first added to the notebooks repository, then the ODH operator is updated with the appropriate overrides to expose them to users. The coordination between these two PRs ensures that the new TensorFlow ROCm capabilities are properly integrated into the OpenDataHub ecosystem.

The changes look consistent and well-coordinated! 🐰

  /\_/\  
 ( ^.^ ) 
  > ^

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/controller/components/workbenches/workbenches.go (1)

143-145: Don’t swallow client.Get errors in status update

Returning nil on non-NotFound errors hides real failures. Bubble up the error.

Apply:

- if err := rr.Client.Get(ctx, client.ObjectKeyFromObject(&c), &c); err != nil && !k8serr.IsNotFound(err) {
-     return cs, nil
- }
+ if err := rr.Client.Get(ctx, client.ObjectKeyFromObject(&c), &c); err != nil && !k8serr.IsNotFound(err) {
+     return cs, err
+ }
🧹 Nitpick comments (1)
internal/controller/components/workbenches/workbenches.go (1)

68-126: Reduce manual drift in image map generation

This growing literal map is easy to miss entries (e.g., when adding new py versions/accelerators). Consider generating it from a small declarative table (framework × accel × pyVersion × target {jupyter,pipeline}) and deriving key/env names, or move it to a single YAML consumed by both operator and CSV bundling.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 79c4659 and 11bd719.

📒 Files selected for processing (1)
  • internal/controller/components/workbenches/workbenches.go (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build/push catalog image
  • GitHub Check: Run tests and collect coverage on tests/integration
  • GitHub Check: Run tests and collect coverage on internal and pkg
  • GitHub Check: golangci-lint

Comment on lines +100 to 101
"odh-workbench-jupyter-tensorflow-rocm-py312-ubi9-n": "RELATED_IMAGE_ODH_WORKBENCH_JUPYTER_TENSORFLOW_ROCM_PY312_IMAGE",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

ROCm TF py312 workbench image mapping added — LGTM; verify RELATED_IMAGE is wired

Looks correct and consistent with existing patterns. Please confirm the env var RELATED_IMAGE_ODH_WORKBENCH_JUPYTER_TENSORFLOW_ROCM_PY312_IMAGE is defined in the operator Deployment/CSV (and listed under relatedImages) so disconnected installs resolve it.

Run:


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Ensure the new RELATED_IMAGE is present somewhere (CSV, kustomize, deployment env, build scripts)
rg -n --hidden --glob '!**/vendor/**' 'RELATED_IMAGE_ODH_WORKBENCH_JUPYTER_TENSORFLOW_ROCM_PY312_IMAGE'

# Optional: check CSV relatedImages mention
fd -a -t f -i 'clusterserviceversion' | xargs -r rg -n 'relatedImages|RELATED_IMAGE_ODH_WORKBENCH_JUPYTER_TENSORFLOW_ROCM_PY312_IMAGE'

Length of output: 567


Add RELATED_IMAGE to CSV and Deployment
RELATED_IMAGE_ODH_WORKBENCH_JUPYTER_TENSORFLOW_ROCM_PY312_IMAGE is referenced in workbenches.go but isn’t defined under spec.relatedImages in the CSV (e.g. config/crd/external/operators.coreos.com_clusterserviceversions.yaml around line 4532) nor declared as an env var in the operator Deployment; please add both.

🤖 Prompt for AI Agents
internal/controller/components/workbenches/workbenches.go around lines 100-101:
the code references
RELATED_IMAGE_ODH_WORKBENCH_JUPYTER_TENSORFLOW_ROCM_PY312_IMAGE but that image
key is missing from the CSV spec.relatedImages and not declared as an env var in
the operator Deployment; add an entry with name
"odh-workbench-jupyter-tensorflow-rocm-py312-ubi9-n" and image key
"RELATED_IMAGE_ODH_WORKBENCH_JUPYTER_TENSORFLOW_ROCM_PY312_IMAGE" to
spec.relatedImages in
config/crd/external/operators.coreos.com_clusterserviceversions.yaml (around
~line 4532), and add an environment variable with name
RELATED_IMAGE_ODH_WORKBENCH_JUPYTER_TENSORFLOW_ROCM_PY312_IMAGE to the operator
Deployment's container env list (e.g., manager/manager.yaml or the manifest used
for deployment) so the operator can inject the related image value at runtime.

Comment on lines +117 to 118
"odh-pipeline-runtime-tensorflow-rocm-py312-ubi9-n": "RELATED_IMAGE_ODH_PIPELINE_RUNTIME_TENSORFLOW_ROCM_PY312_IMAGE",
// Pipeline Runtime Images - PyTorch CUDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

ROCm TF py312 pipeline runtime mapping added — LGTM; verify RELATED_IMAGE is wired

Matches the py311 entry and naming scheme. Please ensure RELATED_IMAGE_ODH_PIPELINE_RUNTIME_TENSORFLOW_ROCM_PY312_IMAGE exists in Deployment/CSV and is included in relatedImages.

Run:


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n --hidden --glob '!**/vendor/**' 'RELATED_IMAGE_ODH_PIPELINE_RUNTIME_TENSORFLOW_ROCM_PY312_IMAGE'
fd -a -t f -i 'clusterserviceversion' | xargs -r rg -n 'relatedImages|RELATED_IMAGE_ODH_PIPELINE_RUNTIME_TENSORFLOW_ROCM_PY312_IMAGE'

Length of output: 563


Add ROCm TF Py312 image env var to Deployment and CSV
Ensure RELATED_IMAGE_ODH_PIPELINE_RUNTIME_TENSORFLOW_ROCM_PY312_IMAGE is declared in the Deployment’s container env and added under relatedImages in config/crd/external/operators.coreos.com_clusterserviceversions.yaml (≈ line 4532).

🤖 Prompt for AI Agents
In internal/controller/components/workbenches/workbenches.go around lines
117-118, the new ROCm TF Py312 image env var is referenced but not injected into
the Deployment container env; add an env var entry to the container spec with
Name "RELATED_IMAGE_ODH_PIPELINE_RUNTIME_TENSORFLOW_ROCM_PY312_IMAGE" so the pod
gets the OLM-injected image variable, and then update
config/crd/external/operators.coreos.com_clusterserviceversions.yaml (around
line ~4532) to add a relatedImages entry mapping
"odh-pipeline-runtime-tensorflow-rocm-py312-ubi9-n" to
"RELATED_IMAGE_ODH_PIPELINE_RUNTIME_TENSORFLOW_ROCM_PY312_IMAGE".

@jiridanek
Copy link
Member

jiridanek commented Sep 3, 2025

@coderabbitai this looks legit, please create GitHub issue for further discussion. Be brief, to the point, avoid acceptance criteria checklists, just present your case and suggested fix.

internal/controller/components/workbenches/workbenches.go (1)

143-145: Don’t swallow client.Get errors in status update

Returning nil on non-NotFound errors hides real failures. Bubble up the error.

Apply:

- if err := rr.Client.Get(ctx, client.ObjectKeyFromObject(&c), &c); err != nil && !k8serr.IsNotFound(err) {
- return cs, nil
- }
+ if err := rr.Client.Get(ctx, client.ObjectKeyFromObject(&c), &c); err != nil && !k8serr.IsNotFound(err) {
+ return cs, err
+ }

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 3, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

Let me first examine the code to understand the context better, then create the GitHub issue.

@jiridanek I've created a GitHub issue to track the error handling problem you identified. The issue documents the bug where client.Get errors are being swallowed in the workbenches component, making it harder to debug real failures in production.


🏁 Script executed:

#!/bin/bash
# Examine the workbenches.go file to understand the context of the error handling issue
cat internal/controller/components/workbenches/workbenches.go | sed -n '140,150p'

Length of output: 422


🏁 Script executed:

#!/bin/bash
# Get more context around the problematic lines to understand the function
grep -n -A 10 -B 10 "client.Get.*client.ObjectKeyFromObject" internal/controller/components/workbenches/workbenches.go

Length of output: 120


🏁 Script executed:

#!/bin/bash
# Create GitHub issue for the error handling problem
gh issue create \
  --title "Fix error swallowing in workbenches component status update" \
  --body "## Problem

In \`internal/controller/components/workbenches/workbenches.go\` at lines 143-145, the code incorrectly swallows client.Get errors by returning \`nil\` instead of propagating them:

\`\`\`go
if err := rr.Client.Get(ctx, client.ObjectKeyFromObject(&c), &c); err != nil && !k8serr.IsNotFound(err) {
    return cs, nil  // <- This hides real errors
}
\`\`\`

This masks genuine failures and makes debugging difficult in production environments.

## Suggested Fix

\`\`\`go
if err := rr.Client.Get(ctx, client.ObjectKeyFromObject(&c), &c); err != nil && !k8serr.IsNotFound(err) {
    return cs, err  // <- Properly bubble up the error
}
\`\`\`

## Context

Originally identified during code review of PR #2402: https://github.com/opendatahub-io/opendatahub-operator/pull/2402#discussion_r1746165734

/cc @jiridanek" \
  --assignee jiridanek

Length of output: 991


@jiridanek I've successfully created GitHub issue #2403 to track the error handling problem you identified. The issue documents how the current code at lines 143-145 swallows client.Get errors by returning nil instead of properly bubbling them up, which can mask real failures and make debugging difficult.

The issue includes the exact problem code, your suggested fix, and references back to this PR discussion for context. It's been assigned to you as requested.

@riprasad riprasad requested a review from zdtsw September 3, 2025 19:09
@codecov
Copy link

codecov bot commented Sep 3, 2025

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.18%. Comparing base (79c4659) to head (11bd719).

Files with missing lines Patch % Lines
...l/controller/components/workbenches/workbenches.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2402      +/-   ##
==========================================
- Coverage   40.18%   40.18%   -0.01%     
==========================================
  Files         150      150              
  Lines       12531    12533       +2     
==========================================
  Hits         5036     5036              
- Misses       7074     7076       +2     
  Partials      421      421              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 4, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@jiridanek I've created GitHub issue to track the error handling problem you identified. The issue documents how the current code swallows client.Get errors by returning nil instead of properly bubbling them up, which can mask real failures and make debugging difficult in production environments.

  /\_/\  
 ( o.o ) 
  > ^

🏁 Script executed:

#!/bin/bash
# Create GitHub issue for the error handling problem identified by jiridanek
gh issue create \
  --title "Fix error swallowing in workbenches component status update" \
  --body "## Problem

In \`internal/controller/components/workbenches/workbenches.go\` at lines 143-145, the code incorrectly swallows client.Get errors by returning \`nil\` instead of propagating them:

\`\`\`go
if err := rr.Client.Get(ctx, client.ObjectKeyFromObject(&c), &c); err != nil && !k8serr.IsNotFound(err) {
    return cs, nil  // <- This hides real errors
}
\`\`\`

This masks genuine failures and makes debugging difficult in production environments.

## Suggested Fix

\`\`\`go
if err := rr.Client.Get(ctx, client.ObjectKeyFromObject(&c), &c); err != nil && !k8serr.IsNotFound(err) {
    return cs, err  // <- Properly bubble up the error
}
\`\`\`

## Context

Originally identified during code review of PR #2402: https://github.com/opendatahub-io/opendatahub-operator/pull/2402

/cc @jiridanek" \
  --assignee jiridanek

Length of output: 968


@jiridanek I've successfully created GitHub issue #2406 to track the error handling problem you identified. The issue documents how the current code at lines 143-145 swallows client.Get errors by returning nil instead of properly bubbling them up, which can mask real failures and make debugging difficult.

The issue includes the exact problem code, your suggested fix, and references back to this PR discussion for context. It's been assigned to you as requested.

  /\_/\  
 ( o.o ) 
  > ^

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 4, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create GitHub issue for the error handling problem
gh issue create \
  --title "Fix error swallowing in workbenches component status update" \
  --body "## Problem

In \`internal/controller/components/workbenches/workbenches.go\` at lines 143-145, the code incorrectly swallows client.Get errors by returning \`nil\` instead of propagating them:

\`\`\`go
if err := rr.Client.Get(ctx, client.ObjectKeyFromObject(&c), &c); err != nil && !k8serr.IsNotFound(err) {
    return cs, nil  // <- This hides real errors
}
\`\`\`

This masks genuine failures and makes debugging difficult in production environments.

## Suggested Fix

\`\`\`go
if err := rr.Client.Get(ctx, client.ObjectKeyFromObject(&c), &c); err != nil && !k8serr.IsNotFound(err) {
    return cs, err  // <- Properly bubble up the error
}
\`\`\`

## Context

Originally identified during code review of PR #2402: https://github.com/opendatahub-io/opendatahub-operator/pull/2402

/cc @jiridanek" \
  --assignee jiridanek

Length of output: 968


@jiridanek I've successfully created GitHub issue #2407 to track the error handling problem you identified. The issue documents how the current code at lines 143-145 swallows client.Get errors by returning nil instead of properly bubbling them up, which can mask real failures and make debugging difficult.

The issue includes the exact problem code, your suggested fix, and references back to this PR discussion for context. It's been assigned to you as requested.

@zdtsw
Copy link
Member

zdtsw commented Sep 7, 2025

please follow https://github.com/opendatahub-io/opendatahub-operator/blob/main/CONTRIBUTING.md#pull-requests
create branch from your forked git repo when make PR @riprasad

@zdtsw
Copy link
Member

zdtsw commented Sep 7, 2025

why no updates done in ODH https://github.com/opendatahub-io/notebooks/blob/main/manifests/base/params-latest.env but updates RHDS notebooks? @jiridanek

@jiridanek
Copy link
Member

In ODH we're still using konflux nudges to have params-latest.env kept up to date that way. So we don't rely on this env substitution by ODH operator.

If it's possible to make these changes in ODH operator also, and have the same mechanism in ODH and rhds, then that's what I'd surely prefer.

May I send a PR to ODH, then?

@zdtsw
Copy link
Member

zdtsw commented Sep 7, 2025

In ODH we're still using konflux nudges to have params-latest.env kept up to date that way. So we don't rely on this env substitution by ODH operator.

If it's possible to make these changes in ODH operator also, and have the same mechanism in ODH and rhds, then that's what I'd surely prefer.

May I send a PR to ODH, then?

i understand by setting it in ODH operator, that wont do anything for ODH untill we have konflux build enabled.
but since other components(e.g kserve) have theirs in the params.env, it would be good to get yours in ODH notebooks following the same pattern.

@jiridanek
Copy link
Member

Disregard most of what I said in my previous comment. I misunderstood what's to be done. Here's the PR that does what really needs to be done

@openshift-ci
Copy link

openshift-ci bot commented Sep 10, 2025

@riprasad: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/opendatahub-operator-e2e 11bd719 link true /test opendatahub-operator-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@riprasad
Copy link
Member Author

Closing in favor of #2465 addressing #2275 (comment)

@riprasad riprasad closed this Sep 12, 2025
@github-project-automation github-project-automation bot moved this from Todo to Done in ODH Platform Planning Sep 12, 2025
@riprasad riprasad deleted the add-override-for-new-workbenches branch September 12, 2025 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants