workaround till we solve mathjax

Datta0 · Datta0 · commit 837d041f2330 · 2025-01-23T09:12:28.000Z
diff --git a/_posts/2025-01-09-transformer-showdown.md b/_posts/2025-01-09-transformer-showdown.md
diff --git a/_posts/2025-01-22-transformer-showdown.md b/_posts/2025-01-22-transformer-showdown.md
@@ -19,18 +19,27 @@ So without further ado, lets get comparing.
 - ### Multi Head Attention (MHA)
     The standard attention that was introduced as part of Attention is all you need. Very commonly used in a lot of models before 2023. Each layer has equal number of query, key and value heads. So if a layer has `h` heads, we'd have `h` queries, `h` keys and `h` values.
 
-    $$ Q_i = W_{q_i} X, \quad K_i = W_{k_i} X, \quad V_i = W_{v_i} X \quad \text {where X is the input}$$
+    $$
+    Q_i = W_{q_i} X, \quad K_i = W_{k_i} X, \quad V_i = W_{v_i} X \quad \text {where X is the input}
+    $$
 
-    $$ Q = [Q_1, Q_2,..., Q_h], \quad K = [K_1, K_2,..., K_h], \quad V = [V_1, V_2,..., V_h] \quad \text {where [ ] is concatenation}$$
+    $$
+    Q = [Q_1, Q_2,..., Q_h], \quad K = [K_1, K_2,..., K_h], \quad V = [V_1, V_2,..., V_h] \quad \text {where [ ] is concatenation}
+    $$
 
-    $$ A_i = \text{Attention}(Q_i, K_i, V_i) = softmax(\frac{Q_iK_i^T}{\sqrt{d_k}})V_i $$
-    $$ A = [A_1, A_2, ..., A_h] \space @ \space  Wo $$
+    $$ 
+    A_i = \text{Attention}(Q_i, K_i, V_i) = softmax(\frac{Q_iK_i^T}{\sqrt{d_k}})V_i
+    $$
+    
+    $$
+    A = [A_1, A_2, ..., A_h] \space @ \space  Wo
+    $$
 
     **Config** : `n` layers. Each layer has `h` heads. Each head has `d` dimensions. Total token count `t`. 
 
-    **Parameters**: $W_{q_i}$ is of shape $(h*d, d)$, so has $h*d^2$ parameters per head. Same for $W_{k_i}, W_{v_i}$. So $W_q + W_k + W_v$ contributes to a total of $3*h*(h*d^2)$ paramters. $W_o$ is of size $(h*d, h*d)$  so $h^2*d^2$ parameters. Total of $4*n*h^2*d^2$.
+    **Parameters**: $$W_{q_i}$$ is of shape $$(h*d, d)$$, so has $$h*d^2$$ parameters per head. Same for $$W_{k_i}, W_{v_i}$$. So $$W_q + W_k + W_v$$ contributes to a total of $$3*h*(h*d^2)$$ paramters. $$W_o$$ is of size $$(h*d, h*d)$$  so $$h^2*d^2$$ parameters. Total of $$4*n*h^2*d^2$$.
 
-    For example, [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/config.json) has [32 attention heads](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/config.json#L14) and [32 key value heads](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/config.json#L16). So llama 2 7B uses MHA. It has a [hidden_size of 4096](https://huggingface.co/meta-llama/Llama-2-7b-hf/,blob/main/config.json#L9). This means, each head has a head_dim (d) of **128**. So the algebra tells us that $W_{q_i}$ would be of shape $(128 *32,128) = (4096,128)$. Each Q (similarly K,V) would be of shape $(4096, 128*32)=(4096,4096)$ contributing to $128^2 * 32^2=16,777,216$ paramters. Executing the below code would give you the same result. Voila.
+    For example, [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/config.json) has [32 attention heads](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/config.json#L14) and [32 key value heads](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/config.json#L16). So llama 2 7B uses MHA. It has a [hidden_size of 4096](https://huggingface.co/meta-llama/Llama-2-7b-hf/,blob/main/config.json#L9). This means, each head has a head_dim (d) of **128**. So the algebra tells us that $$W_{q_i}$$ would be of shape $$(128 *32,128) = (4096,128)$$. Each Q (similarly K,V) would be of shape $$(4096, 128*32)=(4096,4096)$$ contributing to $$128^2 * 32^2=16,777,216$$ paramters. Executing the below code would give you the same result. Voila.
 
     ```python
     llama2 = AutoModelForCausalLM.from_pretrained(
@@ -64,28 +73,38 @@ So without further ado, lets get comparing.
     Wo shape is torch.Size([4096, 4096]) contributes to 16777216 paramters    
     ```
 
-    **Activations**: Each token's query is of size `d` (per head). The same is for key and value. Hence a total of $3*n*h*d$ per token. The final output is of same shape as well. The attention scores, one per each pair of input tokens, form a matrix of size $t*t$ hence $t^2$. So a total of $4*n*h*d*t + n*h*t^2$.
+    **Activations**: Each token's query is of size `d` (per head). The same is for key and value. Hence a total of $$3*n*h*d$$ per token. The final output is of same shape as well. The attention scores, one per each pair of input tokens, form a matrix of size $$t*t$$ hence $$t^2$$. So a total of $$4*n*h*d*t + n*h*t^2$$.
 
-    **KVCache**: Each key and value is of size `d` per token per head per layer. Hence a total of $n*h*d*t$ for key (and value). So the size of KV Cache is $2*n*h*d*t$
+    **KVCache**: Each key and value is of size `d` per token per head per layer. Hence a total of $$n*h*d*t$$ for key (and value). So the size of KV Cache is $$2*n*h*d*t$$
         
 
 - ### Multi Query Attention (MQA)
     A small modification of MHA. Instead of having one key and value per query head, we'd only have a single key per token and all the query heads try to find similarity with that. This results in each layer having `h` queries, `1` key and `1` value. The advantage here is that if you're saving KVCache for speeding up inference, your KVCache is reduced by `h` times. But this is not as performant as MHA as we're reducing the scope of information stored in keys to only single vector.
 
-    $$ A_i = \text{Attention}(Q_i, K, V_i) = softmax(\frac{Q_iK^T}{\sqrt{d_k}})V_i $$
-    $$ A = [A_1, A_2, ..., A_h] \space @ \space  Wo $$
+    $$
+    A_i = \text{Attention}(Q_i, K, V_i) = softmax(\frac{Q_iK^T}{\sqrt{d_k}})V_i
+    $$
+
+    $$
+    A = [A_1, A_2, ..., A_h] \space @ \space  Wo
+    $$
 
-    - **Parameters**: $W_{q_i}$ is of shape $(h*d, d)$, so has $h^2*d^2$ parameters. As for $W_{k_i}, W_{v_i}$, they output a single vector per head. So $(d,d)$ shape and hence $d^2$ parameters. $W_o$ is of size $(h*d, h*d)$  so $h^2*d^2$ parameters. Total of $2*n*h^2*d^2 + 2*n*h*d^2$.
-    - **Activations**: Each token's query is of size `d` (per head) So $h*d$.There is only one key shared across all the heads hence only $2*d$ (key and value). Hence a total of $n*h*d*t + 2*n*d*t$. The final output is of same shape as query as well. The attention scores form a matrix of size $t*t$ hence $t^2$. So a total of $2*n*h*d*t + 2*n*d*t + n*h*t^2$.
-    - **KVCache**: Each key and value is of size `d` per token per layer. Hence a total of $2*n*d*t$. A compression of `h` times compared to MHA.
+    - **Parameters**: $$W_{q_i}$$ is of shape $$(h*d, d)$$, so has $$h^2*d^2$$ parameters. As for $$W_{k_i}, W_{v_i}$$, they output a single vector per head. So $$(d,d)$$ shape and hence $$d^2$$ parameters. $$W_o$$ is of size $$(h*d, h*d)$$  so $$h^2*d^2$$ parameters. Total of $$2*n*h^2*d^2 + 2*n*h*d^2$$.
+    - **Activations**: Each token's query is of size `d` (per head) So $$h*d$$.There is only one key shared across all the heads hence only $$2*d$$ (key and value). Hence a total of $$n*h*d*t + 2*n*d*t$$. The final output is of same shape as query as well. The attention scores form a matrix of size $$t*t$$ hence $$t^2$$. So a total of $$2*n*h*d*t + 2*n*d*t + n*h*t^2$$.
+    - **KVCache**: Each key and value is of size `d` per token per layer. Hence a total of $$2*n*d*t$$. A compression of `h` times compared to MHA.
 
 - ### Grouped Query Attention (GQA)
     This acts as a middle ground between MHA and MQA. Instead of 1 key and value catering to all the queries, we have 1 key and value catering to a group of queries. So we'd have `h` queries, `k` keys and `k` values where `k` divides `h`. So for each layer, you'd be storing `2 * k` embedding vectors. You'd find a lot of models that use this architecture. Generally speaking, a single query caters to `4` to `8` queries. You can identify whether a model uses this when you see [`num_attention_heads`](https://huggingface.co/meta-llama/Llama-3.1-8B/blob/main/config.json#L16)`≠`[`num_kv_heads`](https://huggingface.co/meta-llama/Llama-3.1-8B/blob/main/config.json#L18) in model's config.json
 
-    $$ A_i = \text{Attention}(Q_i, K_{i//g}, V_i) = softmax(\frac{Q_iK_{i//g}^T}{\sqrt{d_k}})V_i $$
-    $$ A = [A_1, A_2, ..., A_h] \space @ \space  Wo $$
+    $$
+    A_i = \text{Attention}(Q_i, K_{i//g}, V_i) = softmax(\frac{Q_iK_{i//g}^T}{\sqrt{d_k}})V_i
+    $$
+
+    $$
+    A = [A_1, A_2, ..., A_h] \space @ \space  Wo
+    $$
 
-    **Parameters**: $W_{q_i}$ is of shape $(h*d, d)$, so has $h^2*d^2$ parameters. As for $W_{k_i}, W_{v_i}$, they output a single vector per group of heads. So $(g*d,d)$ shape and hence $g*d^2$ parameters. $W_o$ is of size $(h*d, h*d)$  so $h^2*d^2$ parameters. Total of $2*n*h^2*d^2 + 2*n*g*h*d^2$.
+    **Parameters**: $$W_{q_i}$$ is of shape $$(h*d, d)$$, so has $$h^2*d^2$$ parameters. As for $$W_{k_i}, W_{v_i}$$, they output a single vector per group of heads. So $$(g*d,d)$$ shape and hence $$g*d^2$$ parameters. $$W_o$$ is of size $$(h*d, h*d)$$  so $$h^2*d^2$$ parameters. Total of $$2*n*h^2*d^2 + 2*n*g*h*d^2$$.
 
     ```python
     llama3 = AutoModelForCausalLM.from_pretrained(
@@ -119,9 +138,9 @@ So without further ado, lets get comparing.
     ```
 
 
-    **Activations**: Each token's query is of size `d` (per head) resulting in $h*d$ sized tensor. There is one key per group of heads hence only $d*g$ (key, value) which together add up to $2*d*g$ per token. Hence a total of $n*h*d*t + 2*n*g*d*t$. The final output is of same shape as query as well. The attention scores form a matrix of size $t*t$ hence $t^2$. So a total of $2*n*h*d*t + 2*n*g*d*t + n*h*t^2$.
+    **Activations**: Each token's query is of size `d` (per head) resulting in $$h*d$$ sized tensor. There is one key per group of heads hence only $$d*g$$ (key, value) which together add up to $$2*d*g$$ per token. Hence a total of $$n*h*d*t + 2*n*g*d*t$$. The final output is of same shape as query as well. The attention scores form a matrix of size $$t*t$$ hence $$t^2$$. So a total of $$2*n*h*d*t + 2*n*g*d*t + n*h*t^2$$.
 
-    **KVCache**: Each key and value is of size `d` per token per layer per group. Hence a total of $2*n*g*d*t$. A compression of `h/g` times compared to MHA.
+    **KVCache**: Each key and value is of size `d` per token per layer per group. Hence a total of $$2*n*g*d*t$$. A compression of `h/g` times compared to MHA.
 
 - ### MultiHead Latent Attention (MLA)
     A new architecture found in DeepSeek V2 family of models. Here, we compress the Keys and values into a latent space and uncompress them back to original space when inference takes place. The idea is to get the advantages of MHA while saving up on KVCache as it scales linearly with context length. Each key and value are compressed from `d` dimensions to `c` dimension space.
@@ -132,25 +151,39 @@ So without further ado, lets get comparing.
     _Share of eigen values contributing to 90% in weight_
 
     $$ 
-    c_t^{KV} = W^{DKV} X,  \quad  \text{ where } c_t^{KV} \in \R^{c} \quad \text { is down projection of keys }\\
-    k_t^C = W^{UK} c_t^{KV} \quad \text {  up projection of keys} \\
+    c_t^{KV} = W^{DKV} X,  \quad  \text{ where } c_t^{KV} \in \mathbb{R}^{c} \quad \text { is down projection of keys }
+    $$
+
+    $$
+    k_t^C = W^{UK} c_t^{KV} \quad \text {  up projection of keys}
+    $$
+
+    $$
     v_t^C = W^{UV} c_t^{KV}  \quad  \text {  up projection of values}
     $$
 
     $$ 
-    c_t^{Q} = W^{DQ} X \quad \text{ where } c_t^{Q} \in \R^{c}  \\
+    c_t^{Q} = W^{DQ} X \quad \text{ where } c_t^{Q} \in \mathbb{R}^{c}
+    $$
+
+    $$
     q_t^C = W^{UQ} c_t^{Q}
     $$
 
 
 
-    $$ A_i = \text{Attention}(Q_i, K_i, V_i) = softmax(\frac{Q_i K_i^T}{\sqrt{d_k}})V_i $$
-    $$ A = [A_1, A_2, ..., A_h] \space @ \space  Wo $$
+    $$
+    A_i = \text{Attention}(Q_i, K_i, V_i) = softmax(\frac{Q_i K_i^T}{\sqrt{d_k}})V_i
+    $$
+    
+    $$
+    A = [A_1, A_2, ..., A_h] \space @ \space  Wo
+    $$
 
     ![Multi Latent Attention formulae](assets/img/blogs/transformer_showdown/mla.png)
     _Multi Latent Attention formulae_
 
-    **KVCache**: Each compressed vector is of size `c` per token per layer per group. Hence a total of $n*g*c*t$. Keys and values are inferred by decompressing this ($k_t^C, v_t^C$). A compression of `2*d/c` times compared to MHA. Note that in final implementation there's a nuance of additional heads (and hence keys and values) for RoPE. That adds a little more overhead. So the compression ratio essentially becomes $2*d/(c+r)$ where r is the RoPE key dimension.
+    **KVCache**: Each compressed vector is of size `c` per token per layer per group. Hence a total of $$n*g*c*t$$. Keys and values are inferred by decompressing this ($$k_t^C, v_t^C$$). A compression of `2*d/c` times compared to MHA. Note that in final implementation there's a nuance of additional heads (and hence keys and values) for RoPE. That adds a little more overhead. So the compression ratio essentially becomes $$2*d/(c+r)$$ where r is the RoPE key dimension.
 
 
 This image from DeepSeek V2 paper gives a crisp view of the above mentioned architectures.
@@ -177,6 +210,8 @@ _MHA vs GQA vs MQA vs MLA_
 
     - Apart from more normalisations there isn't much that would meaningfully contribute to parameters or activations or KVCache as compared to GQA.
 
+## Results and Findings
+
 So now that the introductions are out of the way, the burning question is do the changes contribute to any meaningful differences in the final performance of the models? 
 
 Well the answer is nuanced. Let's see how they stack up.
@@ -194,5 +229,5 @@ _Train losses on minipile dataset_
 On the [minipile dataset](https://huggingface.co/datasets/jeankaddour/minipile) which is approximately 10x larger than the wiki data, I saw that there isn't much to choose between MLA, MHA, GQA and DiffAttention. Which is great since GQA uses 4x less keys and values resulting in 4x less KVCache. Surprisingly, nGPT's losses seem to go down as low as 0.2 when the others hover around 3. I tried to repeat the experiement multiple times with multiple configs only to find a similar loss curve. I also checked validation loss for all the models, they look very similar to train loss curves so there isn't much value in plotting those. We will have to look into why this is the case but it definitely is fascinating. 
 
 
-### Conclusion
+## Conclusion
 All in all, GQA offers a very good alternative to MHA, sometimes even outperforming it while also using 4-8x less space for KVCache. MLA builds upon that by compressing the Keys and values even further. Turns out, this also acts as regularisation. Normalisation is the king of all. Given that normalisation is a key component in deep learning, it is no surprise that making it explicit for every operation. This opens up new paths to LLM training. We will explore the down stream capabilities of the models in a future write up. Until then, Ciao.
diff --git a/assets/js/data/mathjax.js b/assets/js/data/mathjax.js
@@ -1,6 +1,5 @@
 ---
 layout: compress
-# WARNING: Don't use '//' to comment out code, use '{% comment %}' and '{% endcomment %}' instead.
 ---
 
 {%- comment -%}