Explicitly zero out padding token activations for dynamic inference #2008

santhnm2 · 2025-10-28T20:38:28Z

What does this PR do ?

Explicitly zeroes out padding token activations for dynamic inference. This is necessary to ensure that padding tokens do not influence quantization scaling factors.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either [email protected] or [email protected].

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2025-10-28T20:38:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Keshav Santhanam <[email protected]>

deepakn94 · 2025-11-04T16:18:43Z

...t_dynamic_inference_tp1_pp1_583m_cuda_graphs_fp8_logitsmatch/golden_values_dev_dgx_h100.json

-}
+ "0": {
+  "input_prompt": "Time travel to 2008, and go to a bar or a club or one of the myriad disco-basements on the Lower East Side that does not quite know which of those it is. Dance awkwardly in a room full of other glittered-up nerds, and wait for something to happen, buoyed on the feeling that this is the big swollen heart of life, that this is New York like the movies.",
+  "generated_text": " And that this is the place where you can be yourself, and be yourself in the most beautiful way possible. And that this is the place where you",


Why is this changing?

Sorry missed this, the outputs were previously incorrect due to the influence from the padding tokens.

deepakn94

Seems ok to me, but why are we padding out outputs in three specific places (decoder_input in GPTModel and MambaModel, and core_attn_out in Attention)?

santhnm2 · 2025-11-04T16:33:59Z

Seems ok to me, but why are we padding out outputs in three specific places (decoder_input in GPTModel and MambaModel, and core_attn_out in Attention)?

These are the places where the hidden states for padding tokens enter as zero but exit as nonzero. This is problematic because these non-zero padding values can corrupt amax calculations.

Signed-off-by: Keshav Santhanam <[email protected]>

santhnm2 · 2025-11-10T23:13:09Z

/ok to test 39794d9

JRD971000

Discussed over Slack, LGTM, thanks!

santhnm2 added 10 commits October 2, 2025 13:49

Explicitly zero out padding token activations for dynamic inference

5d78942

Fix for CUDA graphs

f4c33df

Fix mamba

dada110

Add fp8 guarding

de2a077

Add function for checking if quantization scales are used

e359bc8

Update comment

b635a02

Use getattr to read config

59f2064

Update fp8 cuda graph test golden values

272aa90

Merge with main

b00b3db

Update golden values

b67156e

santhnm2 requested review from a team as code owners October 28, 2025 20:38

Merge remote-tracking branch 'upstream/main' into embeddings_padding

2031cf4

Signed-off-by: Keshav Santhanam <[email protected]>

santhnm2 self-assigned this Oct 28, 2025

santhnm2 added 2 commits October 28, 2025 15:05

Reset golden values

f398bd7

Signed-off-by: Keshav Santhanam <[email protected]>

Update coreweave golden values

9cbb634

Signed-off-by: Keshav Santhanam <[email protected]>

kvareddy approved these changes Oct 29, 2025

View reviewed changes

santhnm2 added 2 commits October 29, 2025 14:14

Merge remote-tracking branch 'upstream/main' into embeddings_padding

5eff532

Merge with main

142bd7d

Signed-off-by: Keshav Santhanam <[email protected]>

santhnm2 requested a review from a team as a code owner November 3, 2025 23:06

Update functional values

857e7b2

Signed-off-by: Keshav Santhanam <[email protected]>

kvareddy approved these changes Nov 4, 2025

View reviewed changes

deepakn94 reviewed Nov 4, 2025

View reviewed changes

deepakn94 approved these changes Nov 4, 2025

View reviewed changes

shanmugamr1992 approved these changes Nov 4, 2025

View reviewed changes

Merge with main

39794d9

Signed-off-by: Keshav Santhanam <[email protected]>

copy-pr-bot bot temporarily deployed to nemo-ci November 10, 2025 23:13 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 10, 2025 23:13 Failure

ko3n1g added this to the Core 0.16 milestone Nov 10, 2025

copy-pr-bot bot temporarily deployed to nemo-ci November 10, 2025 23:13 Inactive

copy-pr-bot bot temporarily deployed to test November 10, 2025 23:14 Inactive

copy-pr-bot bot temporarily deployed to public November 10, 2025 23:17 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 10, 2025 23:22 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 10, 2025 23:24 Failure

copy-pr-bot bot temporarily deployed to nemo-ci November 10, 2025 23:24 Inactive

JRD971000 approved these changes Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explicitly zero out padding token activations for dynamic inference #2008

Explicitly zero out padding token activations for dynamic inference #2008

santhnm2 commented Oct 28, 2025

Uh oh!

copy-pr-bot bot commented Oct 28, 2025

Uh oh!

deepakn94 Nov 4, 2025

Uh oh!

santhnm2 Nov 7, 2025

Uh oh!

deepakn94 left a comment

Uh oh!

santhnm2 commented Nov 4, 2025

Uh oh!

santhnm2 commented Nov 10, 2025

Uh oh!

JRD971000 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Explicitly zero out padding token activations for dynamic inference #2008

Are you sure you want to change the base?

Explicitly zero out padding token activations for dynamic inference #2008

Conversation

santhnm2 commented Oct 28, 2025

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Oct 28, 2025

Uh oh!

deepakn94 Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

santhnm2 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

deepakn94 left a comment

Choose a reason for hiding this comment

Uh oh!

santhnm2 commented Nov 4, 2025

Uh oh!

santhnm2 commented Nov 10, 2025

Uh oh!

JRD971000 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

(Step 1): Add PR label `Expert Review`