-
Notifications
You must be signed in to change notification settings - Fork 35
SSM debugging and TP prequisites #335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
self.in_proj = nn.Linear( | ||
self.d_model, 2 * self.d_xb + 2 * self.d_inner + self.dt_rank, bias=bias, **factory_kwargs | ||
) | ||
self.in_proj = nn.Linear(self.d_model, 2 * self.d_xb + 2 * self.d_inner, bias=bias, **factory_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this separation is necessary to enable TP. Yet, it would not be backward compatible with exiting m2 checkpoints and requires manually changing the existing checkpoints' state dictionaries to work with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, but my understanding is that there isn't any such checkpoint yet that we want to keep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some checkpoints I trained with previous code. I manually altered the state dict, so I think it should be ok.
✨ Description
Make the transformer debugging tools available to SSMs, mainly through a base Mixer class. Bring the breaking changes for Mamba2 here to enable direct model comparison (Separate the first dt layer, fix initialization).
Also brought some extra changes from #333, it was simpler to do and will help reduce the size of that PR.