Enhance XPU support for benchmarks, profiling, and verification by gurwinderintel · Pull Request #11 · RightNow-AI/autokernel

gurwinderintel · 2026-05-04T10:40:35Z

This pull request adds support for Intel XPU devices (such as Intel GPUs) to the bench.py benchmark harness, making it more device-agnostic and improving its robustness across different hardware backends. The code now automatically detects and uses XPU if CUDA is unavailable, and adapts device-specific operations, error handling, and profiling accordingly.

Device abstraction and detection:

Introduced device-agnostic helpers (_USE_XPU, _DEFAULT_DEVICE, _sync_device, _empty_cache, _reset_peak_memory_stats, _max_memory_allocated, _event, and _OOM_ERROR) to transparently handle CUDA and XPU backends for synchronization, memory management, and event timing. All device-specific calls are now routed through these helpers. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
Updated device detection logic in detect_gpu() and related code to support XPU properties and fallback when neither CUDA nor XPU is available.

Correctness and tolerance handling:

Added support for XPU-specific numerical tolerances (xpu_tolerances) in the benchmark configuration, with logic to override defaults when running on XPU. [1] [2]

Profiling and performance measurement:

Modified profiling logic to use XPU-specific profiler activities if available, with a fallback and warning if not, and generalized device selection for profiling and performance runs. [1] [2] [3]

General improvements:

Replaced hardcoded "cuda" device strings with _DEFAULT_DEVICE throughout the codebase for device-agnostic operation. [1] [2] [3]
Updated top-of-file comment to reflect that the harness is no longer fixed to CUDA-only.

- Enable XPU device detection and synchronization across runtime paths - Update benchmark timing logic for XPU compatibility - Extend verifier to run on XPU with large-output-safe comparison - Add/align XPU-compatible kernel entry checks in matmul and softmax - Keep CUDA behavior intact while adding XPU execution parity - Touches: bench.py, kernel.py, kernels/matmul.py, kernels/softmax.py, prepare.py, profile.py, verify.py

- Remove dead return in verify softmax replacement path\n- Document official Intel XPU support in README requirements and changelog

… range

- Device-agnostic profiler activity selection - Robust NaN/Inf correctness comparison - Platform-aware tolerance tuning (fused_mlp bf16) - Backward compatible with CUDA-only paths

gurwinderintel added 7 commits April 14, 2026 01:48

Update prepare.py

5ffcb5a

Polish XPU upstream patch and docs

710636c

- Remove dead return in verify softmax replacement path\n- Document official Intel XPU support in README requirements and changelog

exp 3: num_warps=8 for reduce kernel on XPU for more parallelism

f33edfd

fused_mlp baseline: relax bfloat16 tolerance to 0.1 for XPU numerical…

b6aa861

… range

Unified OOM exception handling for CUDA/XPU

51e3147

- Device-agnostic profiler activity selection - Robust NaN/Inf correctness comparison - Platform-aware tolerance tuning (fused_mlp bf16) - Backward compatible with CUDA-only paths

Revert README to pre-XPU docs version

dc82327

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance XPU support for benchmarks, profiling, and verification#11

Enhance XPU support for benchmarks, profiling, and verification#11
gurwinderintel wants to merge 7 commits intoRightNow-AI:mainfrom
gurwinderintel:main

gurwinderintel commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gurwinderintel commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant