Skip to content

Added fallback to preload cudnn dlls from nvidia cudnn venv package or torch venv package#1135

Merged
hthadicherla merged 5 commits intomainfrom
hthadicherla/add-cudnn-fix
Apr 1, 2026
Merged

Added fallback to preload cudnn dlls from nvidia cudnn venv package or torch venv package#1135
hthadicherla merged 5 commits intomainfrom
hthadicherla/add-cudnn-fix

Conversation

@hthadicherla
Copy link
Copy Markdown
Contributor

@hthadicherla hthadicherla commented Mar 30, 2026

What does this PR do?

Type of change: Bug fix

There was a QA team that was testing the modelopt 0.43 release and pointed out that we could install nvidia-cudnn pypi packages and use ort.preload_dlls() to load the dlls from the python venv instead of trying to search in system path only .

Here is the info about onnxruntime.preload_dlls() function
image

So added fallback to system path cudnn search to preload dlls and if that also fails then raise exception.

Testing

Tested quantization by installing nvidia-cudnn-cu12 package and removing cudnn dlls from system path. Working as expected.

Summary by CodeRabbit

  • Bug Fixes
    • Improved startup handling when CUDA/cuDNN libraries are missing: the app now attempts a conditional preload from installed Python packages (when supported), logs captured preload output for diagnostics, warns on preload errors, and only raises an error if preload ultimately fails.
  • Documentation
    • Error messages now better explain missing-library issues, note platform/version considerations, and recommend installing a cuDNN pip package (e.g., nvidia-cudnn-cu12) or setting the appropriate environment variable.

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
@hthadicherla hthadicherla requested a review from a team as a code owner March 30, 2026 05:25
@hthadicherla hthadicherla requested a review from gcunhase March 30, 2026 05:25
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4e714f5a-e617-4b8c-85b2-8fecbd61798a

📥 Commits

Reviewing files that changed from the base of the PR and between bf0674b and 2430961.

📒 Files selected for processing (1)
  • modelopt/onnx/quantization/ort_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/onnx/quantization/ort_utils.py

📝 Walkthrough

Walkthrough

Updated _check_for_libcudnn() to attempt a conditional fallback via onnxruntime.preload_dlls() when cuDNN libraries are not found in the system loader path; captures and logs preload stdout/stderr, logs warnings on preload exceptions/failures, and expands the final FileNotFoundError message to include site-packages and cuDNN pip package installation guidance.

Changes

Cohort / File(s) Summary
Quantization utils
modelopt/onnx/quantization/ort_utils.py
Updated _check_for_libcudnn() failure path: when no cuDNN library pattern found in the system loader env, log a warning, conditionally call onnxruntime.preload_dlls() (if available and not running Python 3.10), capture stdout/stderr and log any captured output at debug level, treat a successful preload as a found libcudnn (return True), log a warning if preload raises an exception, and raise a FileNotFoundError with an expanded message that mentions missing libs in both env and site-packages and suggests installing cuDNN pip packages (e.g., nvidia-cudnn-cu12 for Python >=3.11) or setting loader env vars. Added sys import for Python-version check.

Sequence Diagram(s)

sequenceDiagram
    participant Check as _check_for_libcudnn()
    participant Env as System Loader Env (PATH/LD_LIBRARY_PATH)
    participant ORT as onnxruntime.preload_dlls()

    Check->>Env: search for cuDNN library pattern
    alt found in Env
        Env-->>Check: pattern match -> return True
    else not found
        Env-->>Check: no match -> log warning
        Check->>ORT: if available and py != 3.10, call preload_dlls()
        alt preload succeeds
            ORT-->>Check: success -> return True
        else preload raises exception or unavailable
            ORT-->>Check: failure -> log warning (capture stdout/stderr -> debug)
            Check-->>Check: raise FileNotFoundError with expanded message
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a fallback mechanism to preload cuDNN DLLs from Python virtual environment packages (nvidia-cudnn or Torch), which matches the core functionality implemented in the code changes.
Security Anti-Patterns ✅ Passed The pull request does not introduce security anti-patterns; code safely uses exception handling, output capture, and onnxruntime library functions without dangerous patterns.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hthadicherla/add-cudnn-fix

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-01 10:17 UTC

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/onnx/quantization/ort_utils.py (1)

73-79: ⚠️ Potential issue | 🟠 Major

Fallback behavior is documented but not implemented.

At Line 73, the new comment says we “try preloading from Python site-packages,” but the code still immediately logs and raises when PATH/LD_LIBRARY_PATH lookup fails. That means the PR’s stated fallback flow is not actually present.

Please implement the preload attempt in this branch (and only raise if both preload and system-path checks fail), or remove/update the comment to match actual behavior.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9bcd9873-304d-42fa-9fae-0db342773460

📥 Commits

Reviewing files that changed from the base of the PR and between a3f5c46 and ec3acd7.

📒 Files selected for processing (1)
  • modelopt/onnx/quantization/ort_utils.py

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 78.94737% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.21%. Comparing base (a3f5c46) to head (2430961).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/onnx/quantization/ort_utils.py 78.94% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1135      +/-   ##
==========================================
+ Coverage   70.14%   70.21%   +0.06%     
==========================================
  Files         230      230              
  Lines       26053    26098      +45     
==========================================
+ Hits        18276    18325      +49     
+ Misses       7777     7773       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…a-cudnn-cu12 package incase the dlls don't exist in system path

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
modelopt/onnx/quantization/ort_utils.py (1)

73-87: Add targeted tests for the new fallback branch.

Please add unit tests that cover: (1) env-path miss + preload_dlls success, (2) env-path miss + preload_dlls raises, and (3) env-path miss + no preload_dlls attribute. This path is now key for CUDA/TRT EP enablement behavior.

As per coding guidelines "Maintain minimum 70% code coverage on modelopt/* modules (enforced via coverage configuration)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/ort_utils.py` around lines 73 - 87, Add unit tests
for the fallback branch in modelopt/onnx/quantization/ort_utils.py that simulate
"cuDNN not found in env_variable" and exercise the three behaviors of
ort.preload_dlls: (1) mock env lookup to fail and mock ort to have preload_dlls
that returns successfully — assert the function returns True and that
logger.info is called with the preload success message, (2) mock env lookup to
fail and mock ort.preload_dlls to raise an Exception — assert the function does
not return True and that logger.warning contains the raised exception text and
the final logger.error is emitted, and (3) mock env lookup to fail and remove
preload_dlls from ort — assert the final logger.error is emitted. Use the
env_variable name, ort.preload_dlls attribute, and the logger methods
(logger.warning/info/error) from the diff to locate the code path and apply
appropriate mocks/patches in your tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/quantization/ort_utils.py`:
- Around line 75-94: The error message incorrectly claims
onnxruntime.preload_dlls() “could not locate it either” even when preload_dlls
isn't present, and the original preload exception is dropped; modify the logic
in the block around ort.preload_dlls so you capture the preload exception (e.g.,
store it as preload_exc) when hasattr(ort, "preload_dlls") is true and include
that exception detail in the logger.error / FileNotFoundError message, and when
preload_dlls is absent, change the message to state preload wasn't available
rather than that it failed; update the raised FileNotFoundError text (and/or
logger.warning) to conditionally include the preload_exc and the correct
explanation, referencing ort.preload_dlls, logger.warning, logger.error,
env_variable, lib_pattern and the raised FileNotFoundError.

---

Nitpick comments:
In `@modelopt/onnx/quantization/ort_utils.py`:
- Around line 73-87: Add unit tests for the fallback branch in
modelopt/onnx/quantization/ort_utils.py that simulate "cuDNN not found in
env_variable" and exercise the three behaviors of ort.preload_dlls: (1) mock env
lookup to fail and mock ort to have preload_dlls that returns successfully —
assert the function returns True and that logger.info is called with the preload
success message, (2) mock env lookup to fail and mock ort.preload_dlls to raise
an Exception — assert the function does not return True and that logger.warning
contains the raised exception text and the final logger.error is emitted, and
(3) mock env lookup to fail and remove preload_dlls from ort — assert the final
logger.error is emitted. Use the env_variable name, ort.preload_dlls attribute,
and the logger methods (logger.warning/info/error) from the diff to locate the
code path and apply appropriate mocks/patches in your tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 10492761-0f58-4216-9fd4-612047e410e7

📥 Commits

Reviewing files that changed from the base of the PR and between ec3acd7 and e6917aa.

📒 Files selected for processing (1)
  • modelopt/onnx/quantization/ort_utils.py

Comment on lines 75 to 94
if hasattr(ort, "preload_dlls"):
try:
ort.preload_dlls()
logger.info(
"onnxruntime.preload_dlls() succeeded; CUDA/cuDNN DLLs preloaded from site-packages."
" Please check that this is the correct version needed for your ORT version at"
" https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
)
return True
except Exception as e:
logger.warning(f"onnxruntime.preload_dlls() also failed: {e}")

logger.error(f"cuDNN library not found in {env_variable} or site-packages")
raise FileNotFoundError(
f"{lib_pattern} is not accessible in {env_variable}! Please make sure that the path to that library"
f" is in the env var to use the CUDA or TensorRT EP and ensure that the correct version is available."
f" Versioning compatibility can be checked at https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
f"{lib_pattern} is not accessible in {env_variable} and onnxruntime.preload_dlls()"
f" could not locate it either. Please make sure that the path to that library is in the"
f" env var, or install the cuDNN pip package (e.g. nvidia-cudnn-cu12) to use the CUDA or"
f" TensorRT EP. Versioning compatibility can be checked at"
f" https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Improve failure diagnostics for the preload fallback path.

At Line 89, the raised message claims onnxruntime.preload_dlls() “could not locate it either” even when preload_dlls is unavailable (Line 75 check fails). Also, at Lines 84-85 the original preload exception is dropped, which makes root-cause debugging harder.

🔧 Proposed fix
-        if hasattr(ort, "preload_dlls"):
+        preload_err = None
+        attempted_preload = hasattr(ort, "preload_dlls")
+        if attempted_preload:
             try:
                 ort.preload_dlls()
                 logger.info(
                     "onnxruntime.preload_dlls() succeeded; CUDA/cuDNN DLLs preloaded from site-packages."
                     " Please check that this is the correct version needed for your ORT version at"
                     " https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
                 )
                 return True
             except Exception as e:
+                preload_err = e
                 logger.warning(f"onnxruntime.preload_dlls() also failed: {e}")
 
         logger.error(f"cuDNN library not found in {env_variable} or site-packages")
-        raise FileNotFoundError(
-            f"{lib_pattern} is not accessible in {env_variable} and onnxruntime.preload_dlls()"
-            f" could not locate it either. Please make sure that the path to that library is in the"
-            f" env var, or install the cuDNN pip package (e.g. nvidia-cudnn-cu12) to use the CUDA or"
-            f" TensorRT EP. Versioning compatibility can be checked at"
-            f" https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
-        )
+        if attempted_preload:
+            raise FileNotFoundError(
+                f"{lib_pattern} is not accessible in {env_variable}, and onnxruntime.preload_dlls()"
+                f" could not locate it either. Please make sure that the path to that library is in the"
+                f" env var, or install the cuDNN pip package (e.g. nvidia-cudnn-cu12) to use the CUDA or"
+                f" TensorRT EP. Versioning compatibility can be checked at"
+                f" https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
+            ) from preload_err
+        raise FileNotFoundError(
+            f"{lib_pattern} is not accessible in {env_variable}, and onnxruntime.preload_dlls() is not"
+            f" available in this onnxruntime build. Please make sure that the path to that library is in"
+            f" the env var, or install the cuDNN pip package (e.g. nvidia-cudnn-cu12). Versioning compatibility"
+            f" can be checked at"
+            f" https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/ort_utils.py` around lines 75 - 94, The error
message incorrectly claims onnxruntime.preload_dlls() “could not locate it
either” even when preload_dlls isn't present, and the original preload exception
is dropped; modify the logic in the block around ort.preload_dlls so you capture
the preload exception (e.g., store it as preload_exc) when hasattr(ort,
"preload_dlls") is true and include that exception detail in the logger.error /
FileNotFoundError message, and when preload_dlls is absent, change the message
to state preload wasn't available rather than that it failed; update the raised
FileNotFoundError text (and/or logger.warning) to conditionally include the
preload_exc and the correct explanation, referencing ort.preload_dlls,
logger.warning, logger.error, env_variable, lib_pattern and the raised
FileNotFoundError.

…thon

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
@gcunhase
Copy link
Copy Markdown
Contributor

@hthadicherla do we need to add nvidia-cudnn to pyproject.toml? @kevalmorabia97 for viz

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

I think its not needed as the dlls are probably coming from cuda installation already

@hthadicherla
Copy link
Copy Markdown
Contributor Author

hthadicherla commented Mar 31, 2026

@hthadicherla do we need to add nvidia-cudnn to pyproject.toml? @kevalmorabia97 for viz

So if torch with CUDA enabled is installed then no need for installation of nvidia-cudnn or nvidia-cuda-runtime, the dlls are present in torch lib folder. Incase torch cpu is installed, then yes nvidia-cudnn needs to be installed, afaik modelopt doesn't install this automatically.

Since this is supposed to be a fallback incase cudnn doesn't exist in system path. I'm not sure whether we should add it in the toml file or let the user install it as needed.

@kevalmorabia97 thoughts?

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

Can you catch if its not found and print to ask user to install? There are 2 packages - nvidia-cudnn-cu12 and nvidia-cudnn-cu13 depending on installed cuda

@hthadicherla
Copy link
Copy Markdown
Contributor Author

Can you catch if its not found and print to ask user to install? There are 2 packages - nvidia-cudnn-cu12 and nvidia-cudnn-cu13 depending on installed cuda

Yeah maybe i can add that .

Comment on lines +88 to +94
logger.error(f"cuDNN library not found in {env_variable} or site-packages")
raise FileNotFoundError(
f"{lib_pattern} is not accessible in {env_variable}! Please make sure that the path to that library"
f" is in the env var to use the CUDA or TensorRT EP and ensure that the correct version is available."
f" Versioning compatibility can be checked at https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
f"{lib_pattern} is not accessible in {env_variable} and onnxruntime.preload_dlls()"
f" could not locate it either. Please make sure that the path to that library is in the"
f" env var, or install the cuDNN pip package (e.g. nvidia-cudnn-cu12) to use the CUDA or"
f" TensorRT EP. Versioning compatibility can be checked at"
f" https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually a similar comment has been added . It is just a big paragraph though. I will deobfuscate it so that it is clear.

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
…tching mechanism

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
@hthadicherla
Copy link
Copy Markdown
Contributor Author

Updated print statement to prompt user to install nvidia-cudnn-cu12 or cu13 packages if preload_dlls() fails to load the cudnn dll or so files.

@hthadicherla hthadicherla merged commit 4c399af into main Apr 1, 2026
45 checks passed
@hthadicherla hthadicherla deleted the hthadicherla/add-cudnn-fix branch April 1, 2026 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants