Llama 4 issue tracking

High priority
- [x] Grouped MM @tianyu-l
  - [x] https://github.com/pytorch/torchtitan/issues/1237
  - [x] with Activation Checkpointing
    - gets stuck after a couple of iterations
  - [x] with AdamW
    - gets stuck after a couple of iterations
  - [x] with torch.compile @bdhirsh 
    - may need to register `torch._grouped_mm` (and the triton kernel for aligning indices) 
    - https://github.com/pytorch/pytorch/pull/153384
    - https://github.com/pytorch/torchtitan/pull/1182
- [x] auxiliary-loss-free load balancing (https://github.com/pytorch/torchtitan/pull/1114)
  - remove (persistent) buffers in checkpoint
  - currently it's using default stream to do blocking communication when DP degree > 1, need to assess if it's OK
- [x] selective activation checkpointing
  - [x] currently we are checkpointing every other matmul, which is not adapted to MoE router gate / `torch._grouped_mm` ops ([potential solution](https://github.com/pytorch/torchtitan/pull/1182#issuecomment-3049044468))
  - [x] solved in https://github.com/pytorch/torchtitan/pull/1380 

Not high priority for now
- [ ] for-loop implementation of MoE
  - [x] with DTensor TP: sharding propagation overhead due to dynamic shapes
    - need to lift cache hit criteria in DTensor sharding prop
    - may be needed by Loss Parallel for per-sequence loss as well
  - [ ] with torch.compile: branching on “unbacked” symbolic ints
    - static padding of DTensor may solve this

Not llama4 specific
- [ ] PP with num_microbatches < num_stages @H-Huang 
  - tracked in https://github.com/pytorch/pytorch/issues/151269
- [x] FlexAttention
  - [x] not compatible with SAC
    - tracked in https://github.com/pytorch/pytorch/issues/147879
    - fixed in https://github.com/pytorch/torchtitan/pull/1408
  - [x] with full AC, loss becoming nan starting from the 2nd iteration
    - https://github.com/pytorch/pytorch/issues/147879#issuecomment-2879045361
  - [x] with PP @fegin 
    - fixed in https://github.com/pytorch/torchtitan/pull/1130

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama 4 issue tracking #1118

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama 4 issue tracking #1118

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions