[CI][UR][CUDA] Remove NVML xfails after driver update + NVML version check for NVIDIA (new workflow)#21625
Draft
[CI][UR][CUDA] Remove NVML xfails after driver update + NVML version check for NVIDIA (new workflow)#21625
Conversation
Remove xfails from 4 CUDA conformance tests that require NVML: - SuccessThrottleReasons (UR_DEVICE_INFO_CURRENT_CLOCK_THROTTLE_REASONS) - SuccessFanSpeed (UR_DEVICE_INFO_FAN_SPEED) - SuccessMaxPowerLimit (UR_DEVICE_INFO_MAX_POWER_LIMIT) - SuccessMinPowerLimit (UR_DEVICE_INFO_MIN_POWER_LIMIT) These tests were failing with 'Driver/library version mismatch' due to incompatibility between libnvidia-ml.so in the container (550.144) and the NVIDIA driver on the CI host. After CI infrastructure update to driver version 550.144, these tests should now pass.
Add informative comments to 4 NVML-dependent tests explaining that failures due to driver/library version mismatch require updating the NVIDIA driver on CI host to match container's NVML version. Tests updated: - SuccessThrottleReasons - SuccessFanSpeed - SuccessMaxPowerLimit - SuccessMinPowerLimit
3132fb0 to
3861429
Compare
3861429 to
27f0aba
Compare
- Update NVML test comments to be more concise - Add early NVML version check in CI workflow - Check runs before tests, failing fast with clear error message - Use GitHub Actions annotations for better visibility - Display both host driver and container library versions This helps identify driver/library version mismatches immediately rather than waiting for tests to fail with unclear errors.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Remove xfails from 4 CUDA conformance tests that require NVML:
These tests were failing with 'Driver/library version mismatch' due to incompatibility between libnvidia-ml.so in the container (550.144) and the NVIDIA driver on the CI host.
Add Early NVML Version Check in CI
New workflow step that validates compatibility before running tests:
NVML Version Compatibility Rules
Per NVIDIA NVML API documentation:
✅ Driver 550.90.07 + libNVML 550.90.07 (exact match)
✅ Driver 550.90.07 + libNVML 550.54.15 (older library minor version)
❌ Driver 550.90.07 + libNVML 565.57.01 (different major version)
❌ Driver 550.54.15 + libNVML 550.90.07 (newer library minor version)
The check uses nvidia-smi to validate compatibility, which implements NVIDIA's official version checking logic.