-
Notifications
You must be signed in to change notification settings - Fork 325
[Bug][EAGLE] Crash when using mix hidden states #1088
Copy link
Copy link
Open
Labels
Description
Before submitting an issue, please make sure it hasn't been already addressed by searching through the existing and past issues.
Describe the bug
- When enabling mix hidden states mode in EAGLE3 training, I get a crash about some autograd error:
[rank1]: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDABFloat16Type [2, 4096, 2880]], which is output 0 of IndexPutBackward0, is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Steps/Code to reproduce bug
Developed on a WIP branch for GPT-OSS EAGLE3 training. A few local changes which may be related. Creating the issue for tracking in case I cannot solve it.
Expected behavior
Who can help?
- ?
System information
- Container used (if applicable): ?
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ?
- CPU architecture (x86_64, aarch64): ?
- GPU name (e.g. H100, A100, L40S): ?
- GPU memory size: ?
- Number of GPUs: ?
- Library versions (if applicable):
- Python: ?
- ModelOpt version or commit hash: ?
- CUDA: ?
- PyTorch: ?
- Transformers: ?
- TensorRT-LLM: ?
- ONNXRuntime: ?
- TensorRT: ?
- Any other details that may help: ?
Reactions are currently unavailable