Skip to content

Commit 9b602e8

Browse files
committed
fix diffAttn requirements in transformer showdown post
1 parent 4e3e62f commit 9b602e8

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

_posts/2025-01-22-transformer-showdown.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ _MHA vs GQA vs MQA vs MLA_
202202
![Differential Transformer](assets/img/blogs/transformer_showdown/diff_transformer.png)
203203
_Differntial Transformer_
204204

205-
- Here owing to having two attention units, the number of paramters, activations and KVCache requirement goes up by a factor of 2 each as compared to GQA.
205+
- Even though attention units, [each attention head is half the dimension as original](https://github.com/microsoft/unilm/blob/7067d6b4ec0b44fd38e29ab3658765abcd9c7441/Diff-Transformer/multihead_diffattn.py#L50), so the number of paramters, activations and KVCache requirement is the same as that of GQA.
206206

207207
- ### nGPT
208208
Introduced in a [2024 paper from NVIDIA](https://arxiv.org/abs/2410.01131). The main idea is, if normalisation layers are so important to the performance of deep networks and LLMs, why not make normalistion mathemtically implicit to the network. Given this assumption, at every step, we try to make sure we're interacting with normalized vectors and only normalised vectors are passed on after every step. This too is said to improve convergence. We discussed this in great detail in one of our other blogs on substack, [check it out.](https://datta0.substack.com/i/151875954/ngpt-normalized-transformer)

0 commit comments

Comments
 (0)