Fix multiple critical bugs in Transformer implementation#1
Merged
Conversation
This commit fixes six critical bugs in the Transformer model implementation that prevented it from training correctly. The fixes include: 1. **Backpropagation Design:** The backpropagation pass was fundamentally flawed as it lacked the necessary input activations. The `backward` methods for all layers have been refactored to accept the forward pass inputs, and the training loop now caches these activations. 2. **Gradient Calculations:** The gradient calculation logic in `LayerNorm::backward` and `MultiHeadAttention::backward` was mathematically incorrect. These functions have been rewritten with the correct backpropagation implementations. 3. **Unstable Tokenizer:** The tokenizer was rebuilding its vocabulary on every call, making the token-to-word mapping unstable. The vocabulary building is now a separate, one-time step (`build_vocab`), and the `tokenize` function only uses the frozen vocabulary. 4. **Deterministic Sampling:** The random number generator was re-initialized with a fixed seed in the sampling functions, making predictions deterministic. The RNG is now created once and passed as a mutable reference to ensure stateful, non-deterministic sampling. 5. **Identical Embeddings:** The embedding vectors were identical due to a bug in the initialization loop. This was already fixed in the provided code, so the fix was verified. 6. **Unused Custom `exp`:** The `softmax` function was using the standard library's `.exp()` method instead of the custom `exp` function. The call has been corrected. Additionally, compilation errors related to file paths and borrow checking that arose from the fixes have been resolved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit fixes six critical bugs in the Transformer model implementation that prevented it from training correctly.
The fixes include:
Backpropagation Design: The backpropagation pass was fundamentally flawed as it lacked the necessary input activations. The
backwardmethods for all layers have been refactored to accept the forward pass inputs, and the training loop now caches these activations.Gradient Calculations: The gradient calculation logic in
LayerNorm::backwardandMultiHeadAttention::backwardwas mathematically incorrect. These functions have been rewritten with the correct backpropagation implementations.Unstable Tokenizer: The tokenizer was rebuilding its vocabulary on every call, making the token-to-word mapping unstable. The vocabulary building is now a separate, one-time step (
build_vocab), and thetokenizefunction only uses the frozen vocabulary.Deterministic Sampling: The random number generator was re-initialized with a fixed seed in the sampling functions, making predictions deterministic. The RNG is now created once and passed as a mutable reference to ensure stateful, non-deterministic sampling.
Identical Embeddings: The embedding vectors were identical due to a bug in the initialization loop. This was already fixed in the provided code, so the fix was verified.
Unused Custom
exp: Thesoftmaxfunction was using the standard library's.exp()method instead of the customexpfunction. The call has been corrected.Additionally, compilation errors related to file paths and borrow checking that arose from the fixes have been resolved.