Added fallback to preload cudnn dlls from nvidia cudnn venv package or torch venv package#1135
Conversation
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughUpdated Changes
Sequence Diagram(s)sequenceDiagram
participant Check as _check_for_libcudnn()
participant Env as System Loader Env (PATH/LD_LIBRARY_PATH)
participant ORT as onnxruntime.preload_dlls()
Check->>Env: search for cuDNN library pattern
alt found in Env
Env-->>Check: pattern match -> return True
else not found
Env-->>Check: no match -> log warning
Check->>ORT: if available and py != 3.10, call preload_dlls()
alt preload succeeds
ORT-->>Check: success -> return True
else preload raises exception or unavailable
ORT-->>Check: failure -> log warning (capture stdout/stderr -> debug)
Check-->>Check: raise FileNotFoundError with expanded message
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/onnx/quantization/ort_utils.py (1)
73-79:⚠️ Potential issue | 🟠 MajorFallback behavior is documented but not implemented.
At Line 73, the new comment says we “try preloading from Python site-packages,” but the code still immediately logs and raises when PATH/LD_LIBRARY_PATH lookup fails. That means the PR’s stated fallback flow is not actually present.
Please implement the preload attempt in this branch (and only raise if both preload and system-path checks fail), or remove/update the comment to match actual behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9bcd9873-304d-42fa-9fae-0db342773460
📒 Files selected for processing (1)
modelopt/onnx/quantization/ort_utils.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1135 +/- ##
==========================================
+ Coverage 70.14% 70.21% +0.06%
==========================================
Files 230 230
Lines 26053 26098 +45
==========================================
+ Hits 18276 18325 +49
+ Misses 7777 7773 -4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…a-cudnn-cu12 package incase the dlls don't exist in system path Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
modelopt/onnx/quantization/ort_utils.py (1)
73-87: Add targeted tests for the new fallback branch.Please add unit tests that cover: (1) env-path miss +
preload_dllssuccess, (2) env-path miss +preload_dllsraises, and (3) env-path miss + nopreload_dllsattribute. This path is now key for CUDA/TRT EP enablement behavior.As per coding guidelines "Maintain minimum 70% code coverage on modelopt/* modules (enforced via coverage configuration)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/ort_utils.py` around lines 73 - 87, Add unit tests for the fallback branch in modelopt/onnx/quantization/ort_utils.py that simulate "cuDNN not found in env_variable" and exercise the three behaviors of ort.preload_dlls: (1) mock env lookup to fail and mock ort to have preload_dlls that returns successfully — assert the function returns True and that logger.info is called with the preload success message, (2) mock env lookup to fail and mock ort.preload_dlls to raise an Exception — assert the function does not return True and that logger.warning contains the raised exception text and the final logger.error is emitted, and (3) mock env lookup to fail and remove preload_dlls from ort — assert the final logger.error is emitted. Use the env_variable name, ort.preload_dlls attribute, and the logger methods (logger.warning/info/error) from the diff to locate the code path and apply appropriate mocks/patches in your tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/onnx/quantization/ort_utils.py`:
- Around line 75-94: The error message incorrectly claims
onnxruntime.preload_dlls() “could not locate it either” even when preload_dlls
isn't present, and the original preload exception is dropped; modify the logic
in the block around ort.preload_dlls so you capture the preload exception (e.g.,
store it as preload_exc) when hasattr(ort, "preload_dlls") is true and include
that exception detail in the logger.error / FileNotFoundError message, and when
preload_dlls is absent, change the message to state preload wasn't available
rather than that it failed; update the raised FileNotFoundError text (and/or
logger.warning) to conditionally include the preload_exc and the correct
explanation, referencing ort.preload_dlls, logger.warning, logger.error,
env_variable, lib_pattern and the raised FileNotFoundError.
---
Nitpick comments:
In `@modelopt/onnx/quantization/ort_utils.py`:
- Around line 73-87: Add unit tests for the fallback branch in
modelopt/onnx/quantization/ort_utils.py that simulate "cuDNN not found in
env_variable" and exercise the three behaviors of ort.preload_dlls: (1) mock env
lookup to fail and mock ort to have preload_dlls that returns successfully —
assert the function returns True and that logger.info is called with the preload
success message, (2) mock env lookup to fail and mock ort.preload_dlls to raise
an Exception — assert the function does not return True and that logger.warning
contains the raised exception text and the final logger.error is emitted, and
(3) mock env lookup to fail and remove preload_dlls from ort — assert the final
logger.error is emitted. Use the env_variable name, ort.preload_dlls attribute,
and the logger methods (logger.warning/info/error) from the diff to locate the
code path and apply appropriate mocks/patches in your tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 10492761-0f58-4216-9fd4-612047e410e7
📒 Files selected for processing (1)
modelopt/onnx/quantization/ort_utils.py
| if hasattr(ort, "preload_dlls"): | ||
| try: | ||
| ort.preload_dlls() | ||
| logger.info( | ||
| "onnxruntime.preload_dlls() succeeded; CUDA/cuDNN DLLs preloaded from site-packages." | ||
| " Please check that this is the correct version needed for your ORT version at" | ||
| " https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements." | ||
| ) | ||
| return True | ||
| except Exception as e: | ||
| logger.warning(f"onnxruntime.preload_dlls() also failed: {e}") | ||
|
|
||
| logger.error(f"cuDNN library not found in {env_variable} or site-packages") | ||
| raise FileNotFoundError( | ||
| f"{lib_pattern} is not accessible in {env_variable}! Please make sure that the path to that library" | ||
| f" is in the env var to use the CUDA or TensorRT EP and ensure that the correct version is available." | ||
| f" Versioning compatibility can be checked at https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements." | ||
| f"{lib_pattern} is not accessible in {env_variable} and onnxruntime.preload_dlls()" | ||
| f" could not locate it either. Please make sure that the path to that library is in the" | ||
| f" env var, or install the cuDNN pip package (e.g. nvidia-cudnn-cu12) to use the CUDA or" | ||
| f" TensorRT EP. Versioning compatibility can be checked at" | ||
| f" https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements." | ||
| ) |
There was a problem hiding this comment.
Improve failure diagnostics for the preload fallback path.
At Line 89, the raised message claims onnxruntime.preload_dlls() “could not locate it either” even when preload_dlls is unavailable (Line 75 check fails). Also, at Lines 84-85 the original preload exception is dropped, which makes root-cause debugging harder.
🔧 Proposed fix
- if hasattr(ort, "preload_dlls"):
+ preload_err = None
+ attempted_preload = hasattr(ort, "preload_dlls")
+ if attempted_preload:
try:
ort.preload_dlls()
logger.info(
"onnxruntime.preload_dlls() succeeded; CUDA/cuDNN DLLs preloaded from site-packages."
" Please check that this is the correct version needed for your ORT version at"
" https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
)
return True
except Exception as e:
+ preload_err = e
logger.warning(f"onnxruntime.preload_dlls() also failed: {e}")
logger.error(f"cuDNN library not found in {env_variable} or site-packages")
- raise FileNotFoundError(
- f"{lib_pattern} is not accessible in {env_variable} and onnxruntime.preload_dlls()"
- f" could not locate it either. Please make sure that the path to that library is in the"
- f" env var, or install the cuDNN pip package (e.g. nvidia-cudnn-cu12) to use the CUDA or"
- f" TensorRT EP. Versioning compatibility can be checked at"
- f" https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
- )
+ if attempted_preload:
+ raise FileNotFoundError(
+ f"{lib_pattern} is not accessible in {env_variable}, and onnxruntime.preload_dlls()"
+ f" could not locate it either. Please make sure that the path to that library is in the"
+ f" env var, or install the cuDNN pip package (e.g. nvidia-cudnn-cu12) to use the CUDA or"
+ f" TensorRT EP. Versioning compatibility can be checked at"
+ f" https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
+ ) from preload_err
+ raise FileNotFoundError(
+ f"{lib_pattern} is not accessible in {env_variable}, and onnxruntime.preload_dlls() is not"
+ f" available in this onnxruntime build. Please make sure that the path to that library is in"
+ f" the env var, or install the cuDNN pip package (e.g. nvidia-cudnn-cu12). Versioning compatibility"
+ f" can be checked at"
+ f" https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements."
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/quantization/ort_utils.py` around lines 75 - 94, The error
message incorrectly claims onnxruntime.preload_dlls() “could not locate it
either” even when preload_dlls isn't present, and the original preload exception
is dropped; modify the logic in the block around ort.preload_dlls so you capture
the preload exception (e.g., store it as preload_exc) when hasattr(ort,
"preload_dlls") is true and include that exception detail in the logger.error /
FileNotFoundError message, and when preload_dlls is absent, change the message
to state preload wasn't available rather than that it failed; update the raised
FileNotFoundError text (and/or logger.warning) to conditionally include the
preload_exc and the correct explanation, referencing ort.preload_dlls,
logger.warning, logger.error, env_variable, lib_pattern and the raised
FileNotFoundError.
…thon Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
|
@hthadicherla do we need to add |
|
I think its not needed as the dlls are probably coming from cuda installation already |
So if torch with CUDA enabled is installed then no need for installation of nvidia-cudnn or nvidia-cuda-runtime, the dlls are present in torch lib folder. Incase torch cpu is installed, then yes nvidia-cudnn needs to be installed, afaik modelopt doesn't install this automatically. Since this is supposed to be a fallback incase cudnn doesn't exist in system path. I'm not sure whether we should add it in the toml file or let the user install it as needed. @kevalmorabia97 thoughts? |
|
Can you catch if its not found and print to ask user to install? There are 2 packages - nvidia-cudnn-cu12 and nvidia-cudnn-cu13 depending on installed cuda |
Yeah maybe i can add that . |
| logger.error(f"cuDNN library not found in {env_variable} or site-packages") | ||
| raise FileNotFoundError( | ||
| f"{lib_pattern} is not accessible in {env_variable}! Please make sure that the path to that library" | ||
| f" is in the env var to use the CUDA or TensorRT EP and ensure that the correct version is available." | ||
| f" Versioning compatibility can be checked at https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements." | ||
| f"{lib_pattern} is not accessible in {env_variable} and onnxruntime.preload_dlls()" | ||
| f" could not locate it either. Please make sure that the path to that library is in the" | ||
| f" env var, or install the cuDNN pip package (e.g. nvidia-cudnn-cu12) to use the CUDA or" | ||
| f" TensorRT EP. Versioning compatibility can be checked at" | ||
| f" https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements." |
There was a problem hiding this comment.
Actually a similar comment has been added . It is just a big paragraph though. I will deobfuscate it so that it is clear.
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
…tching mechanism Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
|
Updated print statement to prompt user to install nvidia-cudnn-cu12 or cu13 packages if preload_dlls() fails to load the cudnn dll or so files. |
What does this PR do?
Type of change: Bug fix
There was a QA team that was testing the modelopt 0.43 release and pointed out that we could install nvidia-cudnn pypi packages and use ort.preload_dlls() to load the dlls from the python venv instead of trying to search in system path only .
Here is the info about onnxruntime.preload_dlls() function

So added fallback to system path cudnn search to preload dlls and if that also fails then raise exception.
Testing
Tested quantization by installing nvidia-cudnn-cu12 package and removing cudnn dlls from system path. Working as expected.
Summary by CodeRabbit