๐ง Reproducing โSuper Weightsโ in Large Language Models
Paper: The Super Weight in Large Language Models
Authors: Mengxia Yu, De Wang, Qi Shan, Colorado J. Reed, Alvin Wan
Affiliation: Apple & University of Notre Dame
arXiv: 2411.07191v2 (July 2025)
๐งฉ 1. Background
Large Language Models (LLMs) often exhibit outlier weights and activations โ values with extremely large magnitudes that strongly influence model quality.
This paper identifies a single scalar parameter, termed a Super Weight (SW), whose removal alone can destroy a modelโs ability to generate text.
Key findings
- Pruning one scalar in Llama-7B causes zero-shot accuracy to drop โ random guessing.
- The same weight induces a Super Activation (SA) โ a huge activation spike that persists across layers.
- Both SW and SA can be found data-free, with a single forward pass.
- Preserving them dramatically improves quantization quality.
๐ง 2. Conceptual Overview
| Term | Description |
|---|---|
| Super Weight (SW) | A single extremely important weight (scalar) in mlp.down_proj of an early transformer block. |
| Super Activation (SA) | The corresponding massive activation value generated by SW; propagates via skip connections. |
| Effect of Pruning SW | Model generates gibberish output, perplexity โ ร1000, zero-shot accuracy โ โ 35 points. |
| Effect of Restoring SA | Restores โ 40 % of performance loss โ shows SW works partly through SA. |
โ๏ธ 3. How to Find Super Weights (Data-Free Method)
Step 1 โ Locate MLP Layers
In each Transformer block, focus on the MLP down-projection (mlp.down_proj) module.
Step 2 โ Forward Pass
Run one forward pass with any prompt ( no dataset required ):
prompt = "My favorite food is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model(**inputs)
Step 3 โ Record Activations
Hook the input and output of each down_proj layer to capture activations:
activations = {}
def hook_fn(module, inp, out):
activations[module] = (inp[0].detach(), out.detach())
for i, layer in enumerate(model.model.layers):
layer.mlp.down_proj.register_forward_hook(hook_fn)
model(**inputs)
Step 4 โ Find Activation Spikes
For each layer, compute maximum absolute values per channel:
max_in = torch.max(torch.abs(inp), dim=0).values
max_out = torch.max(torch.abs(out), dim=0).values
Plot or inspect their peaks across layers (Figure 3 of the paper).
A layer with a sharp activation spike indicates presence of a Super Weight.
Step 5 โ Determine Coordinates
- Row index = channel of max output (
out.argmax()โ row) - Column index = channel of max input (
inp.argmax()โ col) - The Super Weight is:
model.layers[layer_id].mlp.down_proj.weight[row, col]
Example (Llama-7B): layer[2].mlp.down_proj.weight[3968, 7003].
๐งฎ 4. Mathematical Explanation
For down-projection layer:
[
Y = X W^T
]
If a super activation ( Y_{ij} ) is dominant,
then it is mainly produced by one large inputโweight pair ((X_{ik}, W_{jk})).
Detecting the indices of extreme ( X_{ik} ) and ( Y_{ij} ) reveals the Super Weight ( W_{jk} ).
๐ 5. Known Super Weight Coordinates (Table 2)
| Model | Layer | Type | Coordinates [row, col] |
|---|---|---|---|
| Llama-7B | 2 | mlp.down_proj | [3968, 7003] |
| Llama-13B | 2 | mlp.down_proj | [2231, 2278], [2231, 6939] |
| Llama-30B | 3 / 10 | mlp.down_proj | [5633, 12817], [5633, 17439], [5633, 14386] |
| Llama-2 7B | 1 | mlp.down_proj | [2533, 7890] |
| Mistral-7B | 1 | mlp.down_proj | [2070, 7310] |
| OLMo-7B | 1 / 2 / 7 / 24 | mlp.down_proj | [269, 7467], [269, 8275], [269, 453], [269, 2300] |
| Phi-3 mini-4k-instruct | 2 / 4 | mlp.down_proj | [525, 808], [1113, 2723], โฆ |
๐งช 6. Verification Procedure
โ Step A โ Pruning Test
row, col = 3968, 7003
model.model.layers[2].mlp.down_proj.weight[row, col] = 0
Then generate text:
print(model.generate(**tokenizer("My favorite condiment is", return_tensors="pt")))
โ If output becomes gibberish โ found SW successfully.
โ Step B โ Super Activation Restoration
Record that activation value before pruning, restore it manually after pruning to verify partial recovery.
โก 7. Practical PyTorch Snippet
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Hello world"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
def find_super_weight(model, inputs, threshold=3.0):
for i, layer in enumerate(model.model.layers):
x = layer.mlp.gate_proj(inputs['input_ids'].float())
y = layer.mlp.down_proj(x)
max_in, idx_in = torch.max(torch.abs(x), dim=1)
max_out, idx_out = torch.max(torch.abs(y), dim=1)
if max_in.max() > threshold and max_out.max() > threshold:
print(f"[Layer {i}] Super weight candidate at ({idx_out.item()}, {idx_in.item()})")
find_super_weight(model, inputs)
๐ 8. Interpretation & Use Cases
| Use Case | Effect of Preserving SW/SA |
|---|---|
| Quantization | Enhances simple round-to-nearest (INT4/INT8) to โ 70โ80 % of SmoothQuant quality. |
| Model Compression | Allows larger block sizes (e.g., 512ร512) with less degradation. |
| Explainability | Reveals that a few weights govern semantic token probabilities (stopword suppression). |
๐งญ 9. Summary
- ๐งฉ Super Weights exist โ a few scalars dominate LLM behavior.
- โ๏ธ They can be found with a single forward pass.
- โก Preserving them is vital for model compression and quantization.
- ๐ Author released a directory of SW coordinates for open LLMs.
๐ 10. References
Yu et al., โThe Super Weight in Large Language Models,โ arXiv:2411.07191v2, 2025.
Sun et al., Massive Activations in Large Language Models, ICLR Workshop 2024.
Dettmers et al., GPTQ / AWQ / SmoothQuant (2022โ2024).