Skip to content

[proof-of-concept] NVFP4 GEMM via BF16 dequant#518

Draft
matthiasdiener wants to merge 124 commits intomdiener/nvfp4-recipefrom
mdiener/nvfp4-gemm
Draft

[proof-of-concept] NVFP4 GEMM via BF16 dequant#518
matthiasdiener wants to merge 124 commits intomdiener/nvfp4-recipefrom
mdiener/nvfp4-gemm

Conversation

@matthiasdiener
Copy link
Copy Markdown
Contributor

@matthiasdiener matthiasdiener commented Apr 2, 2026

Description

Based on #517.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

ptrendx and others added 30 commits November 14, 2025 17:05
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
* jax quickstart guide first commit

Signed-off-by: tdophung <tdophung@nvidia.com>

* edit the syntax errors and remove unnecessary comments in utils. Add some footnotes in the quick start notebook

Signed-off-by: tdophung <tdophung@nvidia.com>

* Fix greptiles comments on spelling, deepcopy, vjp function signature comaptibility with speedometer

Signed-off-by: tdophung <tdophung@nvidia.com>

* Add Copyright to utils and fix some more greptiles complaints

Signed-off-by: tdophung <tdophung@nvidia.com>

* Add comments to alternative of layers

Signed-off-by: tdophung <tdophung@nvidia.com>

* Remove weight sharing between different iterations of the transformerLayer

Signed-off-by: tdophung <tdophung@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: tdophung <tdophung@nvidia.com>

* Add enum for attention implementations. Fix inconsistency between fuse and unfused TE impls to achieve same performance (removing extra dropout layer in fused layers. Also some minor wording changes

Signed-off-by: tdophung <tdophung@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug in TransformerLayer expected input shape being [sequence, batch, ...] instead of [batch, sequence,...]

Signed-off-by: tdophung <tdophung@nvidia.com>

* Changing structure of notebook to  bring fp8 ahead of fuse, to allow for fuse to take effect because quantization exist as suggested. Also make TransformerLayer perf get closer to Fused by setting hidden_dropout=0

Signed-off-by: tdophung <tdophung@nvidia.com>

* add option to choose between different attention implementation in call of BasicTETransformerLayer and demonstrated difference in runtime between using flax and using te's attetion implementation

Signed-off-by: tdophung <tdophung@nvidia.com>

* Fix mistake in lacking attention_implementation in FuseTETransformerLayer

Signed-off-by: tdophung <tdophung@nvidia.com>

* Removing AttentionWrapper and custom built DPA, using flax and TE's impl only, removing last mention of Pytorch

Signed-off-by: tdophung <tdophung@nvidia.com>

* More changing to markdowns to remove pytorch

Signed-off-by: tdophung <tdophung@nvidia.com>

* cosmetics fixes

Signed-off-by: tdophung <tdophung@nvidia.com>

* changing names of all implementations

Signed-off-by: tdophung <tdophung@nvidia.com>

* change fp8_autocast to autocast, make causal mask, and some wording changes

Signed-off-by: tdophung <tdophung@nvidia.com>

---------

Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: tdophung <tdophung@dc2-container-xterm-034.prd.it.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Initial changes to remove pytorch overheads

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Enable reference current scaling recipe

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* minor

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* linter

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Test ref vs native

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Evgeny <etsykunov@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* [Common] Deleted unused header (#2324)

Deleted unused header

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* [JAX] L1_jax_distributed_test suit with individual executions (#2321)

* L1 rework

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* comment out test_multi_process_grouped_gemm for now

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* rm e5m2 from test norm + MXFP8

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* for branch

Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* clean up and tests

Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* change tests

Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* [PyTorch debug] Fixes to debug tests failures (#2268)

* code drop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix:

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* [PyTorch Debug] Add max_blockwise_dynamic_range stats (#2137)

* code drop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* [JAX] Fix bug with pre scale bias  (#2300)

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* [JAX] Try to use pre-downloaded dataset artifacts first (#2345)

* Try to use pre-downloaded dataset artifacts first

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Set HF_HUB_OFFLINE to disable any network calls to HF when the
pre-downloaded dataset is available

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* Fix out of bounds access in the FP4 dequantize kernel (#2346)

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* Make FP8 weights compatible with older MCore version (#2342)

* Make cast_master_weights_to_fp8 compatible with older MCore version

Signed-off-by: kunlunl <kunlunl@nvidia.com>

* Rename keep_columnwise to manual_post_all_gather_processing & Optimize unit test

Signed-off-by: kunlunl <kunlunl@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove redundant _test_mini_optimizer()

Signed-off-by: kunlunl <kunlunl@nvidia.com>

---------

Signed-off-by: kunlunl <kunlunl@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* [JAX] Add test to check jaxpr that amax is reused for nvfp4 recipe (#2348)

* Add test to check jaxpr that amax is reused for nvfp4 recipe

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Move test to test_helper.py and rename file

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* Fix sharding of segment position to match id in ring attention. (#2349)

Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* Disable cuDNN attention for known IMA and NaNs (#2344)

* Fix cuDNN backend selection for more case. Add CG as a option as well

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix logic

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix cuDNN checks

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add more checks

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix cuddn version

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix error message

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add check for window size

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* [JAX] Default to fused attention in JAX DPA (#2363)

* Default to fused attention in JAX DPA

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Consolidate documentation for DPA in JAX

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com>

* Correctly update the documentation for defaults in JAX DPA

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com>

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* Update cudnn frontend to v1.16.0 (#2362)

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* [common] Remove kvpacked and qkvpacked attention functions for every kernel type. (#2287)

* code drop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* depracted compile time warning + \warning -> \deprecated

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* Move Triton to common  (#2359)

* move triton to common and change paths

Signed-off-by: tdophung <tdophung@nvidia.com>

* Formatting

Signed-off-by: tdophung <tdophung@nvidia.com>

---------

Signed-off-by: tdophung <tdophung@nvidia.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* [JAX] Fused layers argument default values changed (#2347)

* Changing default activations in MLP, TransformerLayer, dropout rate after FC1 to 0, and return_layernorm_output to False

Signed-off-by: tdophung <tdophung@nvidia.com>

* Fixing the failing tests by hard coding  arguments to the previous values instead of relying on newer default values

Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* remove comment from gpt

Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor changes for num_splits logic

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* replace None with 1 as default

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix last commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix docstring

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix dtype in pack/unpack when FP8

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add fused_attn_supported constraint for some tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update FA3 installation commands

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update FA3 installation commands in DPA

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* separate fused fp8 and f16 flags in tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* initialize fused_attn_supported_f16

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix FA installation in L3 tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: kunlunl <kunlunl@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: root <root@gpu-h100-0496.cm.cluster>
Co-authored-by: Peter Dykas <wdykas@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>
Co-authored-by: Kunlun Li <94586211+kunlunl@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Michael Goldfarb <mgoldfarb@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Teddy Do <tdophung@nvidia.com>
Co-authored-by: wdykas <73254672+wdykas@users.noreply.github.com>
* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* removed packed versions

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* jax

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix:

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* sofmtax_fusion -> softmax_fusion_type

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…tation (#2394)

Signed-off-by: tdophung <tdophung@nvidia.com>
* Cache device tensors properly

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix annotation and add test

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* skip nvfp4 test if not supported

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…LP with checkpoint flag (#2311)

* custom tests for selective activation checkpointing for layernorm mlp

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* add selective layernorm mlp to te.pytorch

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* update test and fix SLNMLP bug

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* implement slnmlp

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix tests pointed out by greptile app bot, still pass

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* minor formatting change in tests/pytorch/selective_layernorm_mlp/distributed/run_numerics.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Jaime <102792198+jaimec00@users.noreply.github.com>
Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* remove duplicate import in test/pytorch/selective_layernorm_mlp/test_recipe.py

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* clean up tests, remove unused imports

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* remove unused paths in test_deffered_init

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix issue with zero_centered_gamma in test_numerics reference implementation

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* clean up tests

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* make comparison.py more extensive, cleaner output

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix small typo in tests/pytorch/selective_layernorm_mlp/compare.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Jaime <102792198+jaimec00@users.noreply.github.com>
Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix typo by grepbot in compare.py

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* make selectiuve activation checkpointing optional in slnmlp via checkpoint flag

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* add comments to clarify logic

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* add checkpoint param to pytests, change compare.py to compare checkppoint=False vs checkpoint=True, skip cuda graph tests for checkpoint=True

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* refactor tests to call modified LayerNormMLP

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* refactor to implement selective activation checkpointing directly into LayerNormMLP, also fix bug to reach cleanup logic in fwd

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix skip explanation for cuda_graphs.py

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* make _recompute deal with lists instead of tuples

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix MOST cuda graph failures by initializing identical quantizers during fwd. Float8CurrentScaling with bf16 and fp16 still fail with checkpointing

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix cuda graphs issue, all tests pass now

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix small logic bugs, clean up

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* integrate tests into main testing scripts

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* incorporate rng state tracking in checkpointing

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up tests

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix return type mismatches

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* remove checkpoint test from test_recipe, add sperate test in test_numerics

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor typo fix

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Jaime <102792198+jaimec00@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clear up assertions in tests/pytorch/layernorm_mlp/test_selective_activation_checkpoint.py

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add license and copyright info

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix lint issues in layernorm_mlp

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* fix cpu_offload_v1 error

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* possibly fix recomputation in cuda graph bug

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* skip cuda graphs test for SLNMLP with SM>=10.0 and using delayed scaling

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo for setting IS_FIRST_FP8_MODULE

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>

---------

Signed-off-by: Jaime Cardenas <jaime@evolutionaryscale.ai>
Signed-off-by: Jaime <102792198+jaimec00@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* fix test_current_device

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* refactor mxfp8_cast_only kernel

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

* fix ptx.cuh after format

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

---------

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>
Co-authored-by: Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com>
Disable Flash attention in Userbuffers tests

Signed-off-by: Tim Moon <tmoon@nvidia.com>
…2397)

* Avoid autogenerating docs for Python files with leading underscore

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Do not exclude __init__.py files from doc generation

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Minor CPU overhead changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Cache per device

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Jack <lityangweiguang@163.com>
* ci: Build and attach bdist wheels to release page

Signed-off-by: oliver könig <okoenig@nvidia.com>

* free up space

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cleanup

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* c28619d8999a147d5e09c1199f84ff6af6ad5794

Signed-off-by: oliver könig <okoenig@nvidia.com>

* c28619d8999a147d5e09c1199f84ff6af6ad5794

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Reduce months to check from 7 to 5

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Update .github/scripts/check_for_ngc_images.sh

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update .github/actions/build-pytorch-wheel/build.sh

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…on (#2103)

Signed-off-by: janbernloehr <jan@bernloehrs.de>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Make BSHD default for Unfused DPA, DPA and MHA in TE JAX

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Remove explicit transpose_batch set for BSHD for DPA in JAX quickstart

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Add warnings in DPA and MHA to warn users of change defaults to BSHD instead of SBHD

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Minimize the scope of when to trigger warnings for changed defaults for transpose_batch_sequence

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…ns_offsets() (#2201)

* Remove unnecessary SWA calculation from _segment_ids_pos_to_seqlens_offsets

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add support for THD+CP+SWA through A2A comms

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* unblock the `padding`+`THD`+`CP(A2A)` with SWA case in A2A forward

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add proper support for thd

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* bug fix

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* enable thd+cp tests as essential

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add cp+thd+a2a test to essential

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix comments from greptile

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add proper skip for flash attention

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix the test to create separate tensors for flash and fused attention backend scenarios

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove redundant compare

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* simplify code

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add note for cu_seqlens_kv and cu_seqlens_kv_padded

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update tests/pytorch/attention/test_attention_with_cp.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Update transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fixo

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix docs

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix the argument name

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…(#2401)

Only disable Flash Attention in Userbuffers test on A100

Signed-off-by: Tim Moon <tmoon@nvidia.com>
… work (#2416)

* Change order of arguments to make jax works

Signed-off-by: tdophung <tdophung@nvidia.com>

* make num_experts a tl.constepxr again

Signed-off-by: tdophung <tdophung@nvidia.com>

---------

Signed-off-by: tdophung <tdophung@nvidia.com>
Add:: NVTE_CUDA_ARCHS to README

Signed-off-by: Shoval Atias <satias@satias-mlt.client.nvidia.com>
Co-authored-by: Shoval Atias <satias@satias-mlt.client.nvidia.com>
* allow dp + fsdp and fixed sr_rng_state partitioning

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* cleanup for lint test

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
remove linear redundant check

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
* minor fix of torch view dtype

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* multi-tensor RHT amax, compiles

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* setup multi_tensor_quantize_nvfp4_impl

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* wire things up and run without crash

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* numerical test

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* unit test passing

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* finish unit test of split quantize api

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* bump up padding to 64 for nvfp4 grouped quantize

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix stochastic rounding

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* lint

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* change error message

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* clean up

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* enable multi-amax without RHT

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* fix col-only quantize mode

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* improve benchmark script

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* add NCU example script

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* add larger test case

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* add contiguous_data_and_scale check to bulk allocator

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* unified naming and differentiate between group_ and multi_

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* move regular amax into multi_tensor.h

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* Disentangle logic for split-quantize and general multi-tensor quantize

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use size_t for split sections

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Suggestions from @greptile-apps

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
… (#2370)

* fix ci issue

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert back testing changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* remove quantizer copy + fused adam working

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix test

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix mxfp8 bug, god knows who created it

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/pytorch/optimizers/fused_adam.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update comment

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
jberchtold-nvidia and others added 10 commits January 15, 2026 14:18
* install cmake in jax build github action

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update build.yml

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
…2595)

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>
* initial impl, not tested

Signed-off-by: tdophung <tdophung@nvidia.com>

* consolidate different unpermute primitives with with_pad and with_merging_probs booleans. Implement partitioning for all permutation primitives

Signed-off-by: tdophung <tdophung@nvidia.com>

* Add distributed test for non-padding permutation

Signed-off-by: tdophung <tdophung@nvidia.com>

* fix issues in distributed test for padding permutation. Make common kernel zero intiialize output permuted scales, permuted probs and output tokens

Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert zeroing in triton common kernel as it is a race condition. Instead, add extra input (aliased wiuth output) buffer to inner primitive of permutation on jax side to pass in zero intitiated buffers done with jnp zeros

Signed-off-by: tdophung <tdophung@nvidia.com>

* fix utils to handle input output aliasing in autotuned kernels

Signed-off-by: tdophung <tdophung@nvidia.com>

* Clean up comments, and add more comments explaining input output alias in utils

Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lint and greptile comment

Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix issues that lint fixing introduced

Signed-off-by: tdophung <tdophung@nvidia.com>

---------

Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add general C API for setting tensor params

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Implement general accessors for NVTETensor

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Refactor tex swizzling to skip if scales are already swizzled

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add checks for non-swizzled scales in MXFP8 and NVFP4 kernels

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Support pre-swizzled scales in MXFP8Tensor

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add tex function to swizzle MXFP8 scales

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix bug in inplace swizzle function

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Tweak comments to use "compact/swizzled format"

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* MXFP8 quantize kernel with pre-swizzled scales

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Expose pre-swizzled scales in modules

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix bug in multi-swizzle

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Support MXFP8 gated activations with swizzled scales

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add PyTorch infrastructure for pre-swizzled NVFP4 tensors

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Deprecate DSv3-specific quantization logic in C API

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove support for DSv3 compact data from quantizer

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove DSv3 compact data format from core lib

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix bug in FP8 all-gather

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix linter warnings

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update JAX to use new swizzled scale API

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Review suggestion from @greptile-apps

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestions from @greptile-apps

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update C++ swizzle test with swizzled scales API

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Return default tensor params when querying params for invalid NVTETensor

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug DSv3 FP8 test failures

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug Userbuffers test failures

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Make sure gated activations populate FP8 transpose if needed

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Review suggestions from @greptile-apps

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Disable pre-swizzling with debug quantizer

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestion from @greptile-apps

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix merge conflicts and review suggestions

Update copyright years. Tweak comments. Fix various complaints from @greptile-apps.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use explicitly sized types in config accessors

Miscellaneous review suggestions from @ptrendx.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Make util header for function that compute swizzled scale index

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Apply suggestions from @greptile-apps

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Update expected error message in FP8 block-scaling test

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestion from @yaox12

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* Update Dockerfile to use ROCm TheRock
* Update wheels building script to work with ROCm TheRock and the latest Manylinux image
* Support default ROCm location /opt/rocm/core
* Fix UB code build on TheRock
* Support comma separated list of target GPU architectures
* Guess ROCm build from HIP_PLATFORM
@matthiasdiener matthiasdiener self-assigned this Apr 2, 2026
@matthiasdiener matthiasdiener added the ci-level 1 CI test level 1 label Apr 2, 2026
matthiasdiener and others added 11 commits April 2, 2026 14:01
Resolve all merge conflicts from the NVIDIA TE v2.12 upstream merge
and adapt new features for ROCm/HIP compatibility.

Key changes:
- Resolved conflicts in common, PyTorch, and JAX directories
- Guarded NV-only environment variables and test cases for ROCm
- Updated copyright headers
- Adapted new test cases for ROCm compatibility
- Reverted upstream swizzle changes incompatible with ROCm
- Updated get_cublas_workspace_size_bytes for ROCm path
IFU 2.12 merge (see IFU-specific commits for details)
* MXFP8 Cast Kernel Optimizations

* Benchmarking scripts for MXFP8 HIP Casts
We need a label to run CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-level 1 CI test level 1

Projects

None yet

Development

Successfully merging this pull request may close these issues.