[rhoai-2.22] Update N-1 image references with the new builds and update manifest package information #1259

atheo89 · 2025-07-24T14:19:04Z

Related to: https://issues.redhat.com/browse/RHOAIENG-30247

This PR backports updates to the RHOAI 2.22 release branch.

Included changes:

Updates N-1 references to point to the new builds
Updates the N-1 commit hash indicator
Bumps version and updates package information in the imagestream bases

atheo89 · 2025-07-24T14:25:33Z

Validation complains about image related to increase. both increased by (100mb)
Where is acceptance

Image name retrieved: 'odh-notebook-jupyter-pytorch-ubi9-python-3.11'
Image created: '2025-07-24T10:37:00.348902007Z'
Image size: 8710 MB
Image size changed by 139 MB (expected: 8571 MB; actual: 8710 MB; treshold: 100 MB).
ERROR: Image definition for 'odh-workbench-jupyter-pytorch-cuda-py311-ubi9-n-1' isn't okay!
----
Image name retrieved: 'odh-notebook-jupyter-trustyai-ubi9-python-3.11'
Image created: '2025-07-24T09:53:43.136688777Z'
Image size: 4483 MB
Image size changed by 286 MB (expected: 4197 MB; actual: 4483 MB; treshold: 100 MB).
ERROR: Image definition for 'odh-workbench-jupyter-trustyai-cpu-py311-ubi9-n-1' isn't okay!

jiridanek · 2025-07-24T15:28:39Z

I'll investigate the trustyai size increase, and then possibly also the other image

jiridanek · 2025-07-24T16:04:18Z

For trustyai, the site-packages python directory got bigger

 │ ✔  Shell diff <(du -ah old | sort -hr | head -n 20) <(du -ah new | sort -hr | head -n 20) (Compare the top 20 largest files and directories between the 'old' and 'new' extracted filesystems to identify … │
 │                                                                                                                                                                                                             │
 │    1,20c1,20                                                                                                                                                                                                │
 │    < 8.1G	old                                                                                                                                                                                               │
 │    < 6.7G	old/opt/app-root                                                                                                                                                                                  │
 │    < 6.7G	old/opt                                                                                                                                                                                           │
 │    < 6.5G	old/opt/app-root/lib/python3.11/site-packages                                                                                                                                                     │
 │    < 6.5G	old/opt/app-root/lib/python3.11                                                                                                                                                                   │
 │    < 6.5G	old/opt/app-root/lib                                                                                                                                                                              │
 │    < 2.8G	old/opt/app-root/lib/python3.11/site-packages/nvidia                                                                                                                                              │
 │    < 1.5G	old/opt/app-root/lib/python3.11/site-packages/torch                                                                                                                                               │
 │    < 1.4G	old/opt/app-root/lib/python3.11/site-packages/torch/lib                                                                                                                                           │
 │    < 1.3G	old/usr                                                                                                                                                                                           │
 │    < 1.1G	old/opt/app-root/lib/python3.11/site-packages/nvidia/cudnn/lib                                                                                                                                    │
 │    < 1.1G	old/opt/app-root/lib/python3.11/site-packages/nvidia/cudnn                                                                                                                                        │
 │    < 815M	old/opt/app-root/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so                                                                                                                          │
 │    < 618M	old/opt/app-root/lib/python3.11/site-packages/nvidia/cudnn/lib/libcudnn_cnn_infer.so.8                                                                                                            │
 │    < 595M	old/opt/app-root/lib/python3.11/site-packages/nvidia/cublas                                                                                                                                       │
 │    < 594M	old/opt/app-root/lib/python3.11/site-packages/nvidia/cublas/lib                                                                                                                                   │
 │    < 491M	old/opt/app-root/lib/python3.11/site-packages/nvidia/cublas/lib/libcublasLt.so.12                                                                                                                 │
 │    < 485M	old/usr/lib64                                                                                                                                                                                     │
 │    < 453M	old/opt/app-root/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so                                                                                                                           │
 │    < 419M	old/opt/app-root/lib/python3.11/site-packages/triton                                                                                                                                              │
 │    ---                                                                                                                                                                                                      │
 │    > 8.5G	new                                                                                                                                                                                               │
 │    > 7.1G	new/opt/app-root                                                                                                                                                                                  │
 │    > 7.1G	new/opt                                                                                                                                                                                           │
 │    > 6.9G	new/opt/app-root/lib/python3.11/site-packages                                                                                                                                                     │
 │    > 6.9G	new/opt/app-root/lib/python3.11                                                                                                                                                                   │
 │    > 6.9G	new/opt/app-root/lib                                                                                                                                                                              │
 │    > 2.7G	new/opt/app-root/lib/python3.11/site-packages/nvidia                                                                                                                                              │
 │    > 1.5G	new/opt/app-root/lib/python3.11/site-packages/torch                                                                                                                                               │
 │    > 1.4G	new/opt/app-root/lib/python3.11/site-packages/torch/lib                                                                                                                                           │
 │    > 1.3G	new/usr                                                                                                                                                                                           │
 │    > 976M	new/opt/app-root/lib/python3.11/site-packages/nvidia/cudnn/lib                                                                                                                                    │
 │    > 976M	new/opt/app-root/lib/python3.11/site-packages/nvidia/cudnn                                                                                                                                        │
 │    > 861M	new/opt/app-root/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so                                                                                                                          │
 │    > 685M	new/opt/app-root/lib/python3.11/site-packages/triton                                                                                                                                              │
 │    > 543M	new/opt/app-root/lib/python3.11/site-packages/nvidia/cudnn/lib/libcudnn_engines_precompiled.so.9                                                                                                  │
 │    > 527M	new/opt/app-root/lib/python3.11/site-packages/triton/_C                                                                                                                                           │
 │    > 527M	new/opt/app-root/lib/python3.11/site-packages/nvidia/cublas/lib                                                                                                                                   │
 │    > 527M	new/opt/app-root/lib/python3.11/site-packages/nvidia/cublas                                                                                                                                       │
 │    > 512M	new/usr/lib64                                                                                                                                                                                     │
 │    > 510M	new/opt/app-root/lib/python3.11/site-packages/triton/_C/libtriton.so                                                                                                                              │
 │

jiridanek · 2025-07-24T16:06:22Z

/lgtm

I believe this is not too concerning

  First, I'll get the sizes of these packages in the old image.
 ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │ ✔  Shell du -sh old/opt/app-root/lib/python3.11/site-packages/{nvidia,torch,triton} (Get the size of the 'nvidia', 'torch', and 'triton' packages in the 'old' image.)                            │
 │                                                                                                                                                                                                   │
 │    2.8G	old/opt/app-root/lib/python3.11/site-packages/nvidia                                                                                                                                      │
 │    1.5G	old/opt/app-root/lib/python3.11/site-packages/torch                                                                                                                                       │
 │    419M	old/opt/app-root/lib/python3.11/site-packages/triton                                                                                                                                      │
 │                                                                                                                                                                                                   │
 ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ Now, I'll get the sizes of the same packages in the new image.
 ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │ ✔  Shell du -sh new/opt/app-root/lib/python3.11/site-packages/{nvidia,torch,triton} (Get the size of the 'nvidia', 'torch', and 'triton' packages in the 'new' image.)                            │
 │                                                                                                                                                                                                   │
 │    2.7G	new/opt/app-root/lib/python3.11/site-packages/nvidia                                                                                                                                      │
 │    1.5G	new/opt/app-root/lib/python3.11/site-packages/torch                                                                                                                                       │
 │    685M	new/opt/app-root/lib/python3.11/site-packages/triton                                                                                                                                      │
 │                                                                                                                                                                                                   │
 ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ Based on the output of the du commands, here is the comparison of the package sizes:

Package	Old Image Size	New Image Size	Difference
nvidia	2.8G	2.7G	-100M
torch	1.5G	1.5G	0
triton	419M	685M	+266M

The analysis reveals the following:

triton: This package is the primary source of the image size increase, growing by 266MB.
nvidia: This package decreased in size by 100MB.
torch: This package remained the same size.

jiridanek · 2025-07-24T16:37:42Z

work out a detailed analysis plan for the other image,

- odh-workbench-jupyter-pytorch-cuda-py311-ubi9-n-1=quay.io/modh/odh-pytorch-notebook@sha256:20f7ab8e7954106ea5e22f3ee0ba8bc7b03975e5735049a765e021aa7eb06861
+ odh-workbench-jupyter-pytorch-cuda-py311-ubi9-n-1=quay.io/modh/odh-pytorch-notebook@sha256:93f7c49aceb840662686bb13a0d33d7c3f4a38ed3926e332c28d87fa4ba6eae9

Excellent. This corrected analysis is precise and provides the clear evidence needed. You've successfully navigated the investigation to its conclusion.

Here is the final summary of the findings.

## Final Analysis: Package Size Breakdown 📊

The detailed comparison of uncompressed package sizes pinpoints the exact contributors to the change.

Primary Drivers of Increase: The significant growth comes from just a few packages related to CUDA and machine learning execution environments.
- cusparselt: Grew by $405.6 \text{ MB}$
- triton: Grew by $260.7 \text{ MB}$
- nvidia: Grew by $64.1 \text{ MB}$
Offsetting Decrease: Interestingly, a major package shrank, which helped mitigate the overall size increase.
- torch: Shrank by $133.0 \text{ MB}$

The net change from just these four key packages is an increase of $597.4 \text{ MB}$ in uncompressed size on disk.

## Uncompressed vs. Compressed Size

You correctly noted that the final image size only changed by $139 \text{ MB}$. This is because container image layers are stored in a compressed format. The large increase in uncompressed library files ($+597.4 \text{ MB}$) compresses down to a much smaller change in the final image ($+139 \text{ MB}$). This confirms the analysis is correct and the identified packages are indeed the cause.

jiridanek · 2025-07-24T16:38:06Z

what is cusparselt

✦ cuSPARSELt is a high-performance library developed by NVIDIA for accelerating sparse matrix-matrix multiplication (SpMM) on modern NVIDIA GPUs.

Here's a breakdown:

cu: Stands for CUDA, indicating it's part of NVIDIA's parallel computing platform.
SPARSE: Refers to sparse matrices, which are matrices where most of the elements are zero. These are very common in deep learning models (like Transformers) and scientific computing.
Lt: Stands for "Light", signifying its focus on inference and lighter-weight operations.

In short, it's a specialized tool that allows deep learning frameworks like PyTorch to perform calculations with sparse matrices much faster by using the dedicated Tensor Core hardware on NVIDIA GPUs.
The significant size increase you saw is likely due to a newer version being included that has more features, pre-compiled kernels for different GPU architectures, or expanded hardware support.

jiridanek · 2025-07-24T16:40:22Z

@atheo89 would you update the pr so the checks are passing please? I believe it is ok

manifests/base/jupyter-pytorch-notebook-imagestream.yaml

…t versions

atheo89 · 2025-07-25T08:58:26Z

At some point the package validation test will fail with this error, it is expected:
2025-07-25 08:28:02 - ERROR - Transformers version check failed. Expected '4.52', found 'Version: 4.49.0'.
as the N version image is not updated yet https://redhat-internal.slack.com/archives/C07TF3MBMMW/p1753369526172749
#1258

atheo89 · 2025-07-25T10:17:26Z

/approve

it was only the expected one as it is mentioned above

openshift-ci · 2025-07-25T10:17:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: atheo89

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [atheo89]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…nv.sh` for PyTorch and TrustyAI environments Discussed in * red-hat-data-services#1259 (comment)

…nv.sh` for PyTorch and TrustyAI environments (#1293) Discussed in * #1259 (comment)

atheo89 added 2 commits July 24, 2025 15:57

Update n-1 references on the params.env

e60fa52

Update n-1 commits

886f24c

openshift-ci bot requested review from andyatmiami and daniellutz July 24, 2025 14:19

atheo89 changed the title ~~Update N-1 image references with the new builds and update manifest package information~~ [rhoai-2.22] Update N-1 image references with the new builds and update manifest package information Jul 24, 2025

atheo89 requested review from jiridanek and removed request for andyatmiami July 24, 2025 14:19

openshift-ci bot assigned jiridanek Jul 24, 2025

openshift-ci bot added the lgtm label Jul 24, 2025

jiridanek removed the lgtm label Jul 24, 2025

jiridanek reviewed Jul 24, 2025

View reviewed changes

manifests/base/jupyter-pytorch-notebook-imagestream.yaml Show resolved Hide resolved

jiridanek mentioned this pull request Jul 24, 2025

RHOAIENG-30247: bump PyTorch and ROCm-PyTorch, and some other 2024b updated notebook imagestreams opendatahub-io/notebooks#1496

Merged

3 tasks

atheo89 force-pushed the update-manifests-222 branch from 788120c to 866b592 Compare July 25, 2025 07:32

Update imagestream python versions annotations to point to the correc…

61126f8

…t versions

atheo89 force-pushed the update-manifests-222 branch from 866b592 to 61126f8 Compare July 25, 2025 08:55

openshift-ci bot added the approved label Jul 25, 2025

atheo89 merged commit 50486ca into red-hat-data-services:rhoai-2.22 Jul 25, 2025
7 of 10 checks passed

jiridanek added a commit to jiridanek/notebooks that referenced this pull request Jul 25, 2025

[2.22] RHOAIENG-30247: update expected image sizes in `check-params-e…

643871a

…nv.sh` for PyTorch and TrustyAI environments Discussed in * red-hat-data-services#1259 (comment)

jiridanek mentioned this pull request Jul 25, 2025

[2.22] RHOAIENG-30247: update expected image sizes in check-params-env.sh for PyTorch and TrustyAI environments #1293

Merged

jiridanek added a commit that referenced this pull request Jul 25, 2025

[2.22] RHOAIENG-30247: update expected image sizes in `check-params-e…

536f7d7

…nv.sh` for PyTorch and TrustyAI environments (#1293) Discussed in * #1259 (comment)

rhods-devops-app bot pushed a commit that referenced this pull request Aug 21, 2025

RHOAIENG-27434: create ROCm Tensorflow Python 3.12 Image (#1259)

406690c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rhoai-2.22] Update N-1 image references with the new builds and update manifest package information #1259

[rhoai-2.22] Update N-1 image references with the new builds and update manifest package information #1259

Uh oh!

atheo89 commented Jul 24, 2025

Uh oh!

atheo89 commented Jul 24, 2025 •

edited

Loading

Uh oh!

jiridanek commented Jul 24, 2025

Uh oh!

jiridanek commented Jul 24, 2025

Uh oh!

jiridanek commented Jul 24, 2025 •

edited

Loading

Uh oh!

jiridanek commented Jul 24, 2025

Uh oh!

jiridanek commented Jul 24, 2025

Uh oh!

jiridanek commented Jul 24, 2025

Uh oh!

Uh oh!

atheo89 commented Jul 25, 2025 •

edited

Loading

Uh oh!

atheo89 commented Jul 25, 2025

Uh oh!

openshift-ci bot commented Jul 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[rhoai-2.22] Update N-1 image references with the new builds and update manifest package information #1259

[rhoai-2.22] Update N-1 image references with the new builds and update manifest package information #1259

Uh oh!

Conversation

atheo89 commented Jul 24, 2025

Uh oh!

atheo89 commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiridanek commented Jul 24, 2025

Uh oh!

jiridanek commented Jul 24, 2025

Uh oh!

jiridanek commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiridanek commented Jul 24, 2025

## Final Analysis: Package Size Breakdown 📊

## Uncompressed vs. Compressed Size

Uh oh!

jiridanek commented Jul 24, 2025

Uh oh!

jiridanek commented Jul 24, 2025

Uh oh!

Uh oh!

atheo89 commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atheo89 commented Jul 25, 2025

Uh oh!

openshift-ci bot commented Jul 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

atheo89 commented Jul 24, 2025 •

edited

Loading

jiridanek commented Jul 24, 2025 •

edited

Loading

atheo89 commented Jul 25, 2025 •

edited

Loading