Skip to content

Conversation

@atheo89
Copy link
Member

@atheo89 atheo89 commented Jul 24, 2025

Related to: https://issues.redhat.com/browse/RHOAIENG-30247

This PR backports updates to the RHOAI 2.22 release branch.

Included changes:

  • Updates N-1 references to point to the new builds
  • Updates the N-1 commit hash indicator
  • Bumps version and updates package information in the imagestream bases

@openshift-ci openshift-ci bot requested review from andyatmiami and daniellutz July 24, 2025 14:19
@atheo89 atheo89 changed the title Update N-1 image references with the new builds and update manifest package information [rhoai-2.22] Update N-1 image references with the new builds and update manifest package information Jul 24, 2025
@atheo89 atheo89 requested review from jiridanek and removed request for andyatmiami July 24, 2025 14:19
@atheo89
Copy link
Member Author

atheo89 commented Jul 24, 2025

Validation complains about image related to increase. both increased by (100mb)
Where is acceptance

Image name retrieved: 'odh-notebook-jupyter-pytorch-ubi9-python-3.11'
Image created: '2025-07-24T10:37:00.348902007Z'
Image size: 8710 MB
Image size changed by 139 MB (expected: 8571 MB; actual: 8710 MB; treshold: 100 MB).
ERROR: Image definition for 'odh-workbench-jupyter-pytorch-cuda-py311-ubi9-n-1' isn't okay!
----
Image name retrieved: 'odh-notebook-jupyter-trustyai-ubi9-python-3.11'
Image created: '2025-07-24T09:53:43.136688777Z'
Image size: 4483 MB
Image size changed by 286 MB (expected: 4197 MB; actual: 4483 MB; treshold: 100 MB).
ERROR: Image definition for 'odh-workbench-jupyter-trustyai-cpu-py311-ubi9-n-1' isn't okay!

@jiridanek
Copy link
Member

I'll investigate the trustyai size increase, and then possibly also the other image

@jiridanek
Copy link
Member

For trustyai, the site-packages python directory got bigger

 │ ✔  Shell diff <(du -ah old | sort -hr | head -n 20) <(du -ah new | sort -hr | head -n 20) (Compare the top 20 largest files and directories between the 'old' and 'new' extracted filesystems to identify … │
 │                                                                                                                                                                                                             │
 │    1,20c1,20                                                                                                                                                                                                │
 │    < 8.1G	old                                                                                                                                                                                               │
 │    < 6.7G	old/opt/app-root                                                                                                                                                                                  │
 │    < 6.7G	old/opt                                                                                                                                                                                           │
 │    < 6.5G	old/opt/app-root/lib/python3.11/site-packages                                                                                                                                                     │
 │    < 6.5G	old/opt/app-root/lib/python3.11                                                                                                                                                                   │
 │    < 6.5G	old/opt/app-root/lib                                                                                                                                                                              │
 │    < 2.8G	old/opt/app-root/lib/python3.11/site-packages/nvidia                                                                                                                                              │
 │    < 1.5G	old/opt/app-root/lib/python3.11/site-packages/torch                                                                                                                                               │
 │    < 1.4G	old/opt/app-root/lib/python3.11/site-packages/torch/lib                                                                                                                                           │
 │    < 1.3G	old/usr                                                                                                                                                                                           │
 │    < 1.1G	old/opt/app-root/lib/python3.11/site-packages/nvidia/cudnn/lib                                                                                                                                    │
 │    < 1.1G	old/opt/app-root/lib/python3.11/site-packages/nvidia/cudnn                                                                                                                                        │
 │    < 815M	old/opt/app-root/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so                                                                                                                          │
 │    < 618M	old/opt/app-root/lib/python3.11/site-packages/nvidia/cudnn/lib/libcudnn_cnn_infer.so.8                                                                                                            │
 │    < 595M	old/opt/app-root/lib/python3.11/site-packages/nvidia/cublas                                                                                                                                       │
 │    < 594M	old/opt/app-root/lib/python3.11/site-packages/nvidia/cublas/lib                                                                                                                                   │
 │    < 491M	old/opt/app-root/lib/python3.11/site-packages/nvidia/cublas/lib/libcublasLt.so.12                                                                                                                 │
 │    < 485M	old/usr/lib64                                                                                                                                                                                     │
 │    < 453M	old/opt/app-root/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so                                                                                                                           │
 │    < 419M	old/opt/app-root/lib/python3.11/site-packages/triton                                                                                                                                              │
 │    ---                                                                                                                                                                                                      │
 │    > 8.5G	new                                                                                                                                                                                               │
 │    > 7.1G	new/opt/app-root                                                                                                                                                                                  │
 │    > 7.1G	new/opt                                                                                                                                                                                           │
 │    > 6.9G	new/opt/app-root/lib/python3.11/site-packages                                                                                                                                                     │
 │    > 6.9G	new/opt/app-root/lib/python3.11                                                                                                                                                                   │
 │    > 6.9G	new/opt/app-root/lib                                                                                                                                                                              │
 │    > 2.7G	new/opt/app-root/lib/python3.11/site-packages/nvidia                                                                                                                                              │
 │    > 1.5G	new/opt/app-root/lib/python3.11/site-packages/torch                                                                                                                                               │
 │    > 1.4G	new/opt/app-root/lib/python3.11/site-packages/torch/lib                                                                                                                                           │
 │    > 1.3G	new/usr                                                                                                                                                                                           │
 │    > 976M	new/opt/app-root/lib/python3.11/site-packages/nvidia/cudnn/lib                                                                                                                                    │
 │    > 976M	new/opt/app-root/lib/python3.11/site-packages/nvidia/cudnn                                                                                                                                        │
 │    > 861M	new/opt/app-root/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so                                                                                                                          │
 │    > 685M	new/opt/app-root/lib/python3.11/site-packages/triton                                                                                                                                              │
 │    > 543M	new/opt/app-root/lib/python3.11/site-packages/nvidia/cudnn/lib/libcudnn_engines_precompiled.so.9                                                                                                  │
 │    > 527M	new/opt/app-root/lib/python3.11/site-packages/triton/_C                                                                                                                                           │
 │    > 527M	new/opt/app-root/lib/python3.11/site-packages/nvidia/cublas/lib                                                                                                                                   │
 │    > 527M	new/opt/app-root/lib/python3.11/site-packages/nvidia/cublas                                                                                                                                       │
 │    > 512M	new/usr/lib64                                                                                                                                                                                     │
 │    > 510M	new/opt/app-root/lib/python3.11/site-packages/triton/_C/libtriton.so                                                                                                                              │
 │                                                                                        

@jiridanek
Copy link
Member

jiridanek commented Jul 24, 2025

/lgtm

I believe this is not too concerning

  First, I'll get the sizes of these packages in the old image.
 ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │ ✔  Shell du -sh old/opt/app-root/lib/python3.11/site-packages/{nvidia,torch,triton} (Get the size of the 'nvidia', 'torch', and 'triton' packages in the 'old' image.)                            │
 │                                                                                                                                                                                                   │
 │    2.8G	old/opt/app-root/lib/python3.11/site-packages/nvidia                                                                                                                                      │
 │    1.5G	old/opt/app-root/lib/python3.11/site-packages/torch                                                                                                                                       │
 │    419M	old/opt/app-root/lib/python3.11/site-packages/triton                                                                                                                                      │
 │                                                                                                                                                                                                   │
 ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ Now, I'll get the sizes of the same packages in the new image.
 ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │ ✔  Shell du -sh new/opt/app-root/lib/python3.11/site-packages/{nvidia,torch,triton} (Get the size of the 'nvidia', 'torch', and 'triton' packages in the 'new' image.)                            │
 │                                                                                                                                                                                                   │
 │    2.7G	new/opt/app-root/lib/python3.11/site-packages/nvidia                                                                                                                                      │
 │    1.5G	new/opt/app-root/lib/python3.11/site-packages/torch                                                                                                                                       │
 │    685M	new/opt/app-root/lib/python3.11/site-packages/triton                                                                                                                                      │
 │                                                                                                                                                                                                   │
 ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ Based on the output of the du commands, here is the comparison of the package sizes:
Package Old Image Size New Image Size Difference
nvidia 2.8G 2.7G -100M
torch 1.5G 1.5G 0
triton 419M 685M +266M

The analysis reveals the following:

  • triton: This package is the primary source of the image size increase, growing by 266MB.
  • nvidia: This package decreased in size by 100MB.
  • torch: This package remained the same size.

@jiridanek
Copy link
Member

work out a detailed analysis plan for the other image,

- odh-workbench-jupyter-pytorch-cuda-py311-ubi9-n-1=quay.io/modh/odh-pytorch-notebook@sha256:20f7ab8e7954106ea5e22f3ee0ba8bc7b03975e5735049a765e021aa7eb06861
+ odh-workbench-jupyter-pytorch-cuda-py311-ubi9-n-1=quay.io/modh/odh-pytorch-notebook@sha256:93f7c49aceb840662686bb13a0d33d7c3f4a38ed3926e332c28d87fa4ba6eae9

Excellent. This corrected analysis is precise and provides the clear evidence needed. You've successfully navigated the investigation to its conclusion.

Here is the final summary of the findings.

## Final Analysis: Package Size Breakdown 📊

The detailed comparison of uncompressed package sizes pinpoints the exact contributors to the change.

  • Primary Drivers of Increase: The significant growth comes from just a few packages related to CUDA and machine learning execution environments.
    • cusparselt: Grew by $405.6 \text{ MB}$
    • triton: Grew by $260.7 \text{ MB}$
    • nvidia: Grew by $64.1 \text{ MB}$
  • Offsetting Decrease: Interestingly, a major package shrank, which helped mitigate the overall size increase.
    • torch: Shrank by $133.0 \text{ MB}$

The net change from just these four key packages is an increase of $597.4 \text{ MB}$ in uncompressed size on disk.

## Uncompressed vs. Compressed Size

You correctly noted that the final image size only changed by $139 \text{ MB}$. This is because container image layers are stored in a compressed format. The large increase in uncompressed library files ($+597.4 \text{ MB}$) compresses down to a much smaller change in the final image ($+139 \text{ MB}$). This confirms the analysis is correct and the identified packages are indeed the cause.

@jiridanek
Copy link
Member

what is cusparselt

✦ cuSPARSELt is a high-performance library developed by NVIDIA for accelerating sparse matrix-matrix multiplication (SpMM) on modern NVIDIA GPUs.

Here's a breakdown:

  • cu: Stands for CUDA, indicating it's part of NVIDIA's parallel computing platform.
  • SPARSE: Refers to sparse matrices, which are matrices where most of the elements are zero. These are very common in deep learning models (like Transformers) and scientific computing.
  • Lt: Stands for "Light", signifying its focus on inference and lighter-weight operations.

In short, it's a specialized tool that allows deep learning frameworks like PyTorch to perform calculations with sparse matrices much faster by using the dedicated Tensor Core hardware on NVIDIA GPUs.
The significant size increase you saw is likely due to a newer version being included that has more features, pre-compiled kernels for different GPU architectures, or expanded hardware support.

@jiridanek jiridanek removed the lgtm label Jul 24, 2025
@jiridanek
Copy link
Member

@atheo89 would you update the pr so the checks are passing please? I believe it is ok

@atheo89 atheo89 force-pushed the update-manifests-222 branch from 866b592 to 61126f8 Compare July 25, 2025 08:55
@atheo89
Copy link
Member Author

atheo89 commented Jul 25, 2025

At some point the package validation test will fail with this error, it is expected:
2025-07-25 08:28:02 - ERROR - Transformers version check failed. Expected '4.52', found 'Version: 4.49.0'.
as the N version image is not updated yet https://redhat-internal.slack.com/archives/C07TF3MBMMW/p1753369526172749
#1258

@atheo89
Copy link
Member Author

atheo89 commented Jul 25, 2025

/approve

it was only the expected one as it is mentioned above

@openshift-ci
Copy link

openshift-ci bot commented Jul 25, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: atheo89

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@atheo89 atheo89 merged commit 50486ca into red-hat-data-services:rhoai-2.22 Jul 25, 2025
7 of 10 checks passed
jiridanek added a commit to jiridanek/notebooks that referenced this pull request Jul 25, 2025
jiridanek added a commit that referenced this pull request Jul 25, 2025
…nv.sh` for PyTorch and TrustyAI environments (#1293)

Discussed in
* #1259 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants