-
Notifications
You must be signed in to change notification settings - Fork 669
Add validation for LinearCrossEntropyLoss with custom_sharded_layers #2900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When using LinearCrossEntropyLoss with custom_sharded_layers in FSDP, 'output' must be included in the layer list to ensure tensor type compatibility. - Added shared validation module in recipes/validation.py - Integrated validation into full_finetune_distributed and lora_finetune_distributed - Added comprehensive unit tests - Provides clear error message to guide users to correct configuration Fixes pytorch#2856
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2900
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@nathan-az Hey Nathan, apologies for confusion. I had meant to revise current PR but created a new branch without thinking. Wanted to give you a little context for this. After looking into it, here's what I found:
Key changes from the previous attempt:
I guess my "fix" here is just making sure users get a descriptive error message rather than runtime error during training. Particularly for LoRA users who might not realize they need to include 'output' when using Thanks again for your review. Your notes were super helpful, and I learned some cool stuff working through this :) |
Hey @jscaldwell55 - thanks again for your work here! 2 request if you have the time. In addition, could you provide an updated traceback/error logs showing what happens when you run your config on It's not immediately obvious to me why having |
@nathan-az Got to work on this some this morning and completed the testing you requested. Here are findings: Test Environment
Test ResultsSingle-GPU Testing✅ Training runs successfully without DTensor errors in single-GPU mode with the problematic configuration: loss:
_component_: torchtune.modules.loss.LinearCrossEntropyLoss
custom_sharded_layers: ['tok_embeddings'] # Without 'output' However, single-GPU doesn't trigger the FSDP DTensor wrapping that causes the issue. Multi-GPU Evidence PyTorch 2.8.0.dev20250625 Key Finding: The error explicitly occurs at cross_entropy_loss.py:71: Analysis
Note on Multi-GPU Testing I wasn't able to generate updated error logs from multi-GPU testing due to resource constraints (only have access to single GPU via Colab). The single-GPU test doesn't trigger the FSDP DTensor wrapping, so it runs without errors. The most recent multi-GPU error logs I could find are from issue #2856, which show the error occurring at the same location in the code (cross_entropy_loss.py:71). While these logs are from PyTorch 2.8.0.dev20250625, the error mechanism appears unchanged based on: The code structure in current main still has the same F.linear call For next steps, would you like me to: Try to find a multi-GPU environment for definitive testing? Please let me know what you're thinking re best path forward. Really appreciate your patience and feedback as I work through this :) |
Hey mate, thanks for checking in on this. I have access to hardware again so ran some tests. I ran two configs - the LLaMA 3.1 8B lora config, and the LLaMA 3.2 1B lora config. Both with I was not able to replicate the issue with 8B, but I was with 1B. I believe the key difference in architecture between these two is that the 1B model uses tied weights between the token embeddings and final output projection. I think this is likely why we see issues when one is sharded and the other is not. If this is correct, I think adding a layer of validation would be useful, but that it should factor this in. |
@nathan-az Awesome, thanks for running those tests! This definitely makes sense to me - models with tied embeddings (where I'll update the validation to be more precise:
This way we can avoid unnecessary restrictions on models that don't have this architectural constraint. Quick question: Should I also check for the reverse case (where |
No worries! So - one thing you reported that I haven't been able to replicate is that custom sharding of both also throws the type mismatch error when the layers are tied for me. I'm not an expert in how FSDP (and FSDP2) work, but I see that the @ebsmothers @joecummings sorry for the direct mention - if anything is immediately obvious to either of you (e.g. does my theory have merit, is there an obvious + is there an obvious/clean fix), that would be great. No pressure to look deeper, I just don't have that much experience with how FSDP does sharding. IMO - if sharding tied weights doesn't currently work, it's not a huge deal and we can just add validation ot confirm. By default, the transformer layers are still sharded when FSDP is used (this can be confirmed in the sharding code), and because these weights are tied, they consume half as much memory as they would by default. |
@nathan-az I'll hold off on implementing anything until we hear from the other reviewers. Happy to go with whatever makes the most sense to y'all. Really appreciate you digging into this with me; I learn so much getting into these distributed training edge cases. |
Summary
Fixes #2856 - DTensor/torch.Tensor mixed type error in Llama4 LoRA fine-tuning
Problem
When using LoRA fine-tuning with
LinearCrossEntropyLoss
andcustom_sharded_layers
, users encounter a tensor type mismatch error:RuntimeError: aten.mm.default: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!
This happens because:
LinearCrossEntropyLoss
usesmodel.output
for the final projectioncustom_sharded_layers = ['tok_embeddings']
without including'output'
custom_sharded_layers
as DTensorsSolution
Added validation that checks if
LinearCrossEntropyLoss
is used withcustom_sharded_layers
and ensures'output'
is included in the list. This provides a clear, actionable error message at setup time rather than a cryptic error during training.Implementation
recipes/validation.py
to avoid code duplicationfull_finetune_distributed.py
andlora_finetune_distributed.py
recipes_setup_model
before FSDP wrapping occursTesting
tests/recipes/test_validation.py
Example Error Message
When misconfigured, users will now see:
ValueError: When using LinearCrossEntropyLoss with custom_sharded_layers, 'output' must be included to ensure tensor compatibility. Example: custom_sharded_layers = ['tok_embeddings', 'output'].
This guides users to the correct configuration immediately.