🧠 Reproducing “Super Weights” in Large Language Models

Paper: The Super Weight in Large Language Models
Authors: Mengxia Yu, De Wang, Qi Shan, Colorado J. Reed, Alvin Wan
Affiliation: Apple & University of Notre Dame
arXiv: 2411.07191v2 (July 2025)

🧩 1. Background

Large Language Models (LLMs) often exhibit outlier weights and activations — values with extremely large magnitudes that strongly influence model quality.
This paper identifies a single scalar parameter, termed a Super Weight (SW), whose removal alone can destroy a model’s ability to generate text.

Key findings

Pruning one scalar in Llama-7B causes zero-shot accuracy to drop → random guessing.
The same weight induces a Super Activation (SA) — a huge activation spike that persists across layers.
Both SW and SA can be found data-free, with a single forward pass.
Preserving them dramatically improves quantization quality.

🧠 2. Conceptual Overview

Term	Description
Super Weight (SW)	A single extremely important weight (scalar) in `mlp.down_proj` of an early transformer block.
Super Activation (SA)	The corresponding massive activation value generated by SW; propagates via skip connections.
Effect of Pruning SW	Model generates gibberish output, perplexity ↑ ×1000, zero-shot accuracy ↓ ≈ 35 points.
Effect of Restoring SA	Restores ≈ 40 % of performance loss → shows SW works partly through SA.

⚙️ 3. How to Find Super Weights (Data-Free Method)

Step 1 — Locate MLP Layers

In each Transformer block, focus on the MLP down-projection (mlp.down_proj) module.

Step 2 — Forward Pass

Run one forward pass with any prompt ( no dataset required ):

prompt = "My favorite food is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model(**inputs)

Step 3 — Record Activations

Hook the input and output of each down_proj layer to capture activations:

activations = {}

def hook_fn(module, inp, out):
    activations[module] = (inp[0].detach(), out.detach())

for i, layer in enumerate(model.model.layers):
    layer.mlp.down_proj.register_forward_hook(hook_fn)
model(**inputs)

Step 4 — Find Activation Spikes

For each layer, compute maximum absolute values per channel:

max_in = torch.max(torch.abs(inp), dim=0).values
max_out = torch.max(torch.abs(out), dim=0).values

Plot or inspect their peaks across layers (Figure 3 of the paper).
A layer with a sharp activation spike indicates presence of a Super Weight.

Step 5 — Determine Coordinates

Row index = channel of max output (out.argmax() → row)
Column index = channel of max input (inp.argmax() → col)

The Super Weight is:

model.layers[layer_id].mlp.down_proj.weight[row, col]

Example (Llama-7B): layer[2].mlp.down_proj.weight[3968, 7003].

🧮 4. Mathematical Explanation

For down-projection layer: [ Y = X W^T ] If a super activation ( Y_{ij} ) is dominant,
then it is mainly produced by one large input–weight pair ((X_{ik}, W_{jk})).
Detecting the indices of extreme ( X_{ik} ) and ( Y_{ij} ) reveals the Super Weight ( W_{jk} ).

📋 5. Known Super Weight Coordinates (Table 2)

Model	Layer	Type	Coordinates [row, col]
Llama-7B	2	mlp.down_proj	[3968, 7003]
Llama-13B	2	mlp.down_proj	[2231, 2278], [2231, 6939]
Llama-30B	3 / 10	mlp.down_proj	[5633, 12817], [5633, 17439], [5633, 14386]
Llama-2 7B	1	mlp.down_proj	[2533, 7890]
Mistral-7B	1	mlp.down_proj	[2070, 7310]
OLMo-7B	1 / 2 / 7 / 24	mlp.down_proj	[269, 7467], [269, 8275], [269, 453], [269, 2300]
Phi-3 mini-4k-instruct	2 / 4	mlp.down_proj	[525, 808], [1113, 2723], …

🧪 6. Verification Procedure

✅ Step A — Pruning Test

row, col = 3968, 7003
model.model.layers[2].mlp.down_proj.weight[row, col] = 0

Then generate text:

print(model.generate(**tokenizer("My favorite condiment is", return_tensors="pt")))

→ If output becomes gibberish → found SW successfully.

✅ Step B — Super Activation Restoration

Record that activation value before pruning, restore it manually after pruning to verify partial recovery.

⚡ 7. Practical PyTorch Snippet

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Hello world"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

def find_super_weight(model, inputs, threshold=3.0):
    for i, layer in enumerate(model.model.layers):
        x = layer.mlp.gate_proj(inputs['input_ids'].float())
        y = layer.mlp.down_proj(x)
        max_in, idx_in = torch.max(torch.abs(x), dim=1)
        max_out, idx_out = torch.max(torch.abs(y), dim=1)
        if max_in.max() > threshold and max_out.max() > threshold:
            print(f"[Layer {i}] Super weight candidate at ({idx_out.item()}, {idx_in.item()})")

find_super_weight(model, inputs)

📈 8. Interpretation & Use Cases

Use Case	Effect of Preserving SW/SA
Quantization	Enhances simple round-to-nearest (INT4/INT8) to ≈ 70–80 % of SmoothQuant quality.
Model Compression	Allows larger block sizes (e.g., 512×512) with less degradation.
Explainability	Reveals that a few weights govern semantic token probabilities (stopword suppression).

🧭 9. Summary

🧩 Super Weights exist — a few scalars dominate LLM behavior.
⚙️ They can be found with a single forward pass.
⚡ Preserving them is vital for model compression and quantization.
📊 Author released a directory of SW coordinates for open LLMs.

📚 10. References

Yu et al., “The Super Weight in Large Language Models,” arXiv:2411.07191v2, 2025.
Sun et al., Massive Activations in Large Language Models, ICLR Workshop 2024.
Dettmers et al., GPTQ / AWQ / SmoothQuant (2022–2024).