A novel approach to interpreting transformer decoder models with equivalent linear reconstruction and decomposition.
Transactions on Machine Learning Research (TMLR), October 2025
NeurIPS Mechanistic Interpretability Workshop 2025
James R. Golden
We demonstrate that large language models can be mapped to equivalent linear systems for any given input sequence, without modifying model weights or altering predictions. We achieve this through strategic gradient computation modifications that create "detached Jacobians", which are linear representations that capture the complete forward computation.
-
Reconstruction: The detached Jacobian linearly reconstructs the output embedding, where the subsequent token probabilities pass torch.allclose at
$10^{-14}$ - Interpretability: Reveals semantic concepts emerging in model layers through the singular value decomposition
-
Efficiency: Enables analysis of up to 14B parameter models (Qwen 3 14B, Gemma 3 12 B, Llama 3.1 8B) passing torch.allclose at
$10^{-14}$ - Different models: Works across model families (Qwen 3, Gemma 3, Llama 3, Phi 4, Mistral Ministral, OLMo 2)
Our approach exploits a fundamental structural property of transformer architectures wherein every operation (gated activations, attention, and normalization) can be expressed as
or example,
where the "detached Jacobian" J
- Normalization: Detach variance computation from gradient path
-
Activations: Freeze nonlinear terms in
$SwiGLU/GELU/Swish$ functions -
Attention: Detach softmax operation while preserving linear
$V$ multiplication - Analysis: Apply SVD to understand learned representations and semantic emergence
Fig. 1: The equivalent linear path through the
- Qwen 3 (8B - 32B parameters)
- Deepseek R1 0528 Qwen 3 (8B parameters)
- Gemma 3 (4B - 12B - 27B parameters)
- Llama 3 (3B - 8B - 70B parameters)
- Phi 4 (3B - 14B parameters)
- Mistral Ministral (8B parameters)
- OLMo 2 (8B parameters)
- Low-rank structure: Models operate in extremely low-dimensional subspaces
- Concept emergence: Semantic concepts appear in later transformer layers
- Token relationships: Singular vectors decode to semantically relevant input/output tokens
- Steering applications: Detached Jacobians enable efficient concept steering
Our analysis reveals:
- Top singular vectors decode to concepts like "Golden", "bridge", "highway"
- Layer-by-layer emergence of geographic and infrastructure concepts
- Extremely sparse activation patterns with few dominant features
Huggingface token with model access required. The code below runs on a free colab T4 instance.
import os
from google.colab import userdata
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = '1'
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')
os.system('git clone https://github.com/jamesgolden1/llms-are-llms.git')
os.chdir('llms-are-llms')
os.system('pip install -r requirements.txt --no-deps')
os.system(f'python -u run_detached_jacobian.py --hf_token {os.environ["HF_TOKEN"]} --model_name "llama-3.2-3b" --text "The Golden"')
Interpretability
- Concept Analysis: Understand what drives model predictions
- Layer Dynamics: Track semantic emergence through transformer layers
- Feature Importance: Identify key input tokens and concepts for next-token prediction
Fig 2: Results for Deepseek R1 0528 Qwen 3 8B.
Model Steering
- Efficient Control: Steer model outputs using detached Jacobians
- Concept Injection: Inject specific concepts (e.g., "Golden Gate Bridge") into continuations
- Safety Applications: Detect and potentially mitigate bias or toxic content
Table 1: Steering results across models.
Research Tools
- Dimensionality Analysis: Measure effective dimensionality of learned representations
- Cross-model Comparisons: Compare semantic structures across model families
- Ablation Studies: Understand token contributions to output token prediction
A Lanczos iteration approach for a matrix-free method to compute the top-k singular vectors of the detached Jacobian for long sequences in Jax for Gemma 3 4B without generating the full matrix, for a 400-token input sequence with 40GB VRAM. Using the matfree package.
This code snippet shows how the Qwen 3 MLP has components frozen at inference to reveal its linear for a given input seequence. The output is the same as the original function. Only the gradient at inference is changed.
The detach() statement in the else clause makes the function linear.
class Qwen3MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
if self.training:
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
else:
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)).clone().detach() * self.up_proj(x))
return down_proj
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
This work builds on foundational research in:
- Transformer interpretability (Elhage et al., 2021)
- Locally linear ReLU neural networks (Mohan et al., 2019)
- Diffusion model linearity (Kadkhodaie et al., 2023)



