Add Megatron-LM cross-entropy integration by PrathyushaPolepalli · Pull Request #1207 · linkedin/Liger-Kernel

PrathyushaPolepalli · 2026-04-28T05:59:09Z

Summary

Adds `apply_liger_kernel_to_megatron()` monkey-patch that swaps Megatron-LM's native `fused_vocab_parallel_cross_entropy` for Liger's Triton cross-entropy kernel.

  from liger_kernel.megatron import apply_liger_kernel_to_megatron                                                                                                                                                                                          
                                                                                                                                                                                                                                                            
  apply_liger_kernel_to_megatron(
      ignore_index=-100,                                                                                                                                                                                                                                    
      label_smoothing=cfg.label_smoothing_factor,                                                      
  )

Enables online softmax + in-place gradients + no full-softmax materialization inside Megatron training pipelines.

Scope: tensor_model_parallel_size=1 only. With TP>1, each rank holds a sharded [N, V/tp] logits slice and CE requires cross-rank all-reduces that Liger's kernel does not perform.

The patch raises RuntimeError at patch time (via megatron.core.parallel_state) and again at call time (via the tp_group argument Megatron passes), so misconfiguration fails loudly. Vocab-parallel support is follow-up work.

Tested on Qwen3-30B-A3B scaled MoE, 1× H100_8, BF16:

Model config:

24 layers, hidden=1024, FFN hidden=6144
128 experts, top-8 routing, MoE FFN hidden=768
~7.8B total params, ~0.8B active per token
Vocab size: 151,936
Sequence length: 4096

Parallelism:

Tensor Parallel (TP): 1
Pipeline Parallel (PP): 1
Expert Parallel (EP): 8 (16 experts per GPU)
Data Parallel (DP): 8 (non-expert), 1 (expert)

Training config:

Global batch size: 1024, Micro batch size: 2
Distributed Adam optimizer
Selective activation recompute (core_attn, mlp)
--cross-entropy-loss-fusion enabled

Numerical correctness: lm_loss ~4.1e-3 in both, no NaN/skipped iterations.
Variance: Liger CE 107.7-109.1 TFLOP/s/GPU (consistent).

megatron_cross_entropy_memory_full_token_length

megatron_cross_entropy_speed_backward_token_length

megatron_cross_entropy_speed_forward_token_length

megatron_cross_entropy_speed_full_token_length

Test setup: Single H100 80GB, sequence length S=2048, batch size B=4, vocab sizes 4K → 131K. Each provider is the same cross-entropy operation, just different implementations:

liger — apply_liger_kernel_to_megatron() patch (Liger's Triton kernel)
torch — standard torch.nn.functional.cross_entropy
megatron — Megatron's native fused_vocab_parallel_cross_entropy

Testing Done

Hardware Type: H100
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Mecoli1219

Overall looks great! Excited to support Megatron with Liger. Left some comments to address.

Mecoli1219 · 2026-05-13T16:20:47Z

+    if tp_size > 1:
+        raise RuntimeError(
+            f"apply_liger_kernel_to_megatron currently requires tensor_model_parallel_size=1, "
+            f"got {tp_size}. Vocab-parallel cross-entropy support is planned as follow-up work."
+        )


This is a constrain that need to be addressed in the future given that TP is a common use case in Megatron, but it's a great start supporting megatron!

BTW, does this patching also not support other parallel strategy? (Sequence Parallel, etc)

Mecoli1219 · 2026-05-13T16:23:16Z

+        global _ACTIVATION_LOGGED
+        if not _ACTIVATION_LOGGED:


Is this necessary?

Mecoli1219 · 2026-05-13T18:12:31Z

+    return liger_fused_vocab_parallel_cross_entropy
+
+
+def apply_liger_kernel_to_megatron(


Can we move it to another file like monkey_patch.py under the same directory? If we want to add more kernel besides CE, it would be cleaner to separate the framework-level and kernel-specific logic. You can mirror src/liger_kernel/trainsformers/:

src/liger_kernel/metatron/ monkey_patch.py # apply_liger_kernel_to_megatron + TP check cross_entropy.py # _build_wrapper + _patch_fused_vocab_parallel_ce other_future_kernel.pys

PrathyushaPolepalli marked this pull request as draft April 28, 2026 05:59

PrathyushaPolepalli force-pushed the megatron-cross-entropy-integration branch 5 times, most recently from ed3c27e to b1fa5bc Compare April 29, 2026 23:35

PrathyushaPolepalli marked this pull request as ready for review April 30, 2026 16:27

PrathyushaPolepalli force-pushed the megatron-cross-entropy-integration branch from b1fa5bc to 41362ee Compare April 30, 2026 17:07

Add Megatron-LM cross-entropy integration

e4b2ff2

PrathyushaPolepalli force-pushed the megatron-cross-entropy-integration branch from 41362ee to e4b2ff2 Compare April 30, 2026 17:11

Mecoli1219 requested changes May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Megatron-LM cross-entropy integration#1207

Add Megatron-LM cross-entropy integration#1207
PrathyushaPolepalli wants to merge 1 commit into
linkedin:mainfrom
PrathyushaPolepalli:megatron-cross-entropy-integration

PrathyushaPolepalli commented Apr 28, 2026 •

edited

Loading

Uh oh!

Mecoli1219 left a comment

Uh oh!

Mecoli1219 May 13, 2026

Uh oh!

Mecoli1219 May 13, 2026

Uh oh!

Mecoli1219 May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return liger_fused_vocab_parallel_cross_entropy


		def apply_liger_kernel_to_megatron(

Conversation

PrathyushaPolepalli commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing Done

Uh oh!

Mecoli1219 left a comment

Choose a reason for hiding this comment

Uh oh!

Mecoli1219 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Mecoli1219 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Mecoli1219 May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PrathyushaPolepalli commented Apr 28, 2026 •

edited

Loading