Skip to content

[CI][UR][CUDA] Remove NVML xfails after driver update + NVML version check for NVIDIA (new workflow)#21625

Draft
kekaczma wants to merge 3 commits intosyclfrom
fix-cuda-nvml-xfails
Draft

[CI][UR][CUDA] Remove NVML xfails after driver update + NVML version check for NVIDIA (new workflow)#21625
kekaczma wants to merge 3 commits intosyclfrom
fix-cuda-nvml-xfails

Conversation

@kekaczma
Copy link
Copy Markdown
Contributor

@kekaczma kekaczma commented Mar 25, 2026

Remove xfails from 4 CUDA conformance tests that require NVML:

  • SuccessThrottleReasons (UR_DEVICE_INFO_CURRENT_CLOCK_THROTTLE_REASONS)
  • SuccessFanSpeed (UR_DEVICE_INFO_FAN_SPEED)
  • SuccessMaxPowerLimit (UR_DEVICE_INFO_MAX_POWER_LIMIT)
  • SuccessMinPowerLimit (UR_DEVICE_INFO_MIN_POWER_LIMIT)

These tests were failing with 'Driver/library version mismatch' due to incompatibility between libnvidia-ml.so in the container (550.144) and the NVIDIA driver on the CI host.

Add Early NVML Version Check in CI
New workflow step that validates compatibility before running tests:

  • Detects host driver version via nvidia-smi
  • Detects container NVML library version from libnvidia-ml.so.1
  • Tests compatibility by running nvidia-smi from container
  • Fails fast with clear error message if versions are incompatible
  • Uses GitHub Actions error annotations for high visibility

NVML Version Compatibility Rules
Per NVIDIA NVML API documentation:

  • Major version must match: Driver 550.x requires libNVML 550.x
  • Library version ≤ Driver version: Library cannot be newer than driver
  • Different major versions always fail: Driver 550.x + libNVML 565.x = mismatch
  • Examples:

✅ Driver 550.90.07 + libNVML 550.90.07 (exact match)
✅ Driver 550.90.07 + libNVML 550.54.15 (older library minor version)
❌ Driver 550.90.07 + libNVML 565.57.01 (different major version)
❌ Driver 550.54.15 + libNVML 550.90.07 (newer library minor version)
The check uses nvidia-smi to validate compatibility, which implements NVIDIA's official version checking logic.

Remove xfails from 4 CUDA conformance tests that require NVML:
- SuccessThrottleReasons (UR_DEVICE_INFO_CURRENT_CLOCK_THROTTLE_REASONS)
- SuccessFanSpeed (UR_DEVICE_INFO_FAN_SPEED)
- SuccessMaxPowerLimit (UR_DEVICE_INFO_MAX_POWER_LIMIT)
- SuccessMinPowerLimit (UR_DEVICE_INFO_MIN_POWER_LIMIT)

These tests were failing with 'Driver/library version mismatch' due to
incompatibility between libnvidia-ml.so in the container (550.144) and
the NVIDIA driver on the CI host.

After CI infrastructure update to driver version 550.144, these tests
should now pass.
Add informative comments to 4 NVML-dependent tests explaining that
failures due to driver/library version mismatch require updating
the NVIDIA driver on CI host to match container's NVML version.

Tests updated:
- SuccessThrottleReasons
- SuccessFanSpeed
- SuccessMaxPowerLimit
- SuccessMinPowerLimit
@kekaczma kekaczma changed the title [UR][CUDA] Remove NVML xfails after driver update [CI][UR][CUDA] Remove NVML xfails after driver update + NVML version check for NVIDIA (new workflow) Mar 31, 2026
@kekaczma kekaczma marked this pull request as ready for review March 31, 2026 12:29
@kekaczma kekaczma requested review from a team as code owners March 31, 2026 12:29
@kekaczma kekaczma force-pushed the fix-cuda-nvml-xfails branch 2 times, most recently from 3132fb0 to 3861429 Compare March 31, 2026 13:53
@kekaczma kekaczma marked this pull request as draft March 31, 2026 13:57
@kekaczma kekaczma force-pushed the fix-cuda-nvml-xfails branch from 3861429 to 27f0aba Compare March 31, 2026 15:38
- Update NVML test comments to be more concise
- Add early NVML version check in CI workflow
- Check runs before tests, failing fast with clear error message
- Use GitHub Actions annotations for better visibility
- Display both host driver and container library versions

This helps identify driver/library version mismatches immediately
rather than waiting for tests to fail with unclear errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant