# ๐Ÿง  Reproducing โ€œSuper Weightsโ€ in Large Language Models **Paper:** *The Super Weight in Large Language Models* **Authors:** Mengxia Yu, De Wang, Qi Shan, Colorado J. Reed, Alvin Wan **Affiliation:** Apple & University of Notre Dame **arXiv:** [2411.07191v2 (July 2025)](https://arxiv.org/abs/2411.07191) --- ## ๐Ÿงฉ 1. Background Large Language Models (LLMs) often exhibit *outlier weights and activations* โ€” values with extremely large magnitudes that strongly influence model quality. This paper identifies a **single scalar parameter**, termed a **Super Weight (SW)**, whose removal alone can **destroy a modelโ€™s ability to generate text**. ### Key findings - Pruning **one scalar** in Llama-7B causes zero-shot accuracy to drop โ†’ random guessing. - The same weight induces a **Super Activation (SA)** โ€” a huge activation spike that persists across layers. - Both SW and SA can be found **data-free**, with a single forward pass. - Preserving them dramatically improves **quantization quality**. --- ## ๐Ÿง  2. Conceptual Overview | Term | Description | |------|--------------| | **Super Weight (SW)** | A single extremely important weight (scalar) in `mlp.down_proj` of an early transformer block. | | **Super Activation (SA)** | The corresponding massive activation value generated by SW; propagates via skip connections. | | **Effect of Pruning SW** | Model generates gibberish output, perplexity โ†‘ ร—1000, zero-shot accuracy โ†“ โ‰ˆ 35 points. | | **Effect of Restoring SA** | Restores โ‰ˆ 40 % of performance loss โ†’ shows SW works partly through SA. | --- ## โš™๏ธ 3. How to Find Super Weights (Data-Free Method) ### Step 1 โ€” Locate MLP Layers In each Transformer block, focus on the **MLP down-projection** (`mlp.down_proj`) module. ### Step 2 โ€” Forward Pass Run one forward pass with any prompt ( no dataset required ): ```python prompt = "My favorite food is" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model(**inputs) ``` ### Step 3 โ€” Record Activations Hook the input and output of each `down_proj` layer to capture activations: ```python activations = {} def hook_fn(module, inp, out): activations[module] = (inp[0].detach(), out.detach()) for i, layer in enumerate(model.model.layers): layer.mlp.down_proj.register_forward_hook(hook_fn) model(**inputs) ``` ### Step 4 โ€” Find Activation Spikes For each layer, compute maximum absolute values per channel: ```python max_in = torch.max(torch.abs(inp), dim=0).values max_out = torch.max(torch.abs(out), dim=0).values ``` Plot or inspect their peaks across layers (Figure 3 of the paper). A layer with a **sharp activation spike** indicates presence of a Super Weight. ### Step 5 โ€” Determine Coordinates - Row index = channel of max output (`out.argmax()` โ†’ row) - Column index = channel of max input (`inp.argmax()` โ†’ col) - The Super Weight is: ``` model.layers[layer_id].mlp.down_proj.weight[row, col] ``` Example (Llama-7B): `layer[2].mlp.down_proj.weight[3968, 7003]`. --- ## ๐Ÿงฎ 4. Mathematical Explanation For down-projection layer: \[ Y = X W^T \] If a super activation \( Y_{ij} \) is dominant, then it is mainly produced by one large inputโ€“weight pair \((X_{ik}, W_{jk})\). Detecting the indices of extreme \( X_{ik} \) and \( Y_{ij} \) reveals the Super Weight \( W_{jk} \). --- ## ๐Ÿ“‹ 5. Known Super Weight Coordinates (Table 2) | Model | Layer | Type | Coordinates [row, col] | |:------|:------|:------|:------| | **Llama-7B** | 2 | mlp.down_proj | [3968, 7003] | | **Llama-13B** | 2 | mlp.down_proj | [2231, 2278], [2231, 6939] | | **Llama-30B** | 3 / 10 | mlp.down_proj | [5633, 12817], [5633, 17439], [5633, 14386] | | **Llama-2 7B** | 1 | mlp.down_proj | [2533, 7890] | | **Mistral-7B** | 1 | mlp.down_proj | [2070, 7310] | | **OLMo-7B** | 1 / 2 / 7 / 24 | mlp.down_proj | [269, 7467], [269, 8275], [269, 453], [269, 2300] | | **Phi-3 mini-4k-instruct** | 2 / 4 | mlp.down_proj | [525, 808], [1113, 2723], โ€ฆ | --- ## ๐Ÿงช 6. Verification Procedure ### โœ… Step A โ€” Pruning Test ```python row, col = 3968, 7003 model.model.layers[2].mlp.down_proj.weight[row, col] = 0 ``` Then generate text: ```python print(model.generate(**tokenizer("My favorite condiment is", return_tensors="pt"))) ``` โ†’ If output becomes gibberish โ†’ found SW successfully. ### โœ… Step B โ€” Super Activation Restoration Record that activation value before pruning, restore it manually after pruning to verify partial recovery. --- ## โšก 7. Practical PyTorch Snippet ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "meta-llama/Llama-2-7b-hf" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "Hello world" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) def find_super_weight(model, inputs, threshold=3.0): for i, layer in enumerate(model.model.layers): x = layer.mlp.gate_proj(inputs['input_ids'].float()) y = layer.mlp.down_proj(x) max_in, idx_in = torch.max(torch.abs(x), dim=1) max_out, idx_out = torch.max(torch.abs(y), dim=1) if max_in.max() > threshold and max_out.max() > threshold: print(f"[Layer {i}] Super weight candidate at ({idx_out.item()}, {idx_in.item()})") find_super_weight(model, inputs) ``` --- ## ๐Ÿ“ˆ 8. Interpretation & Use Cases | Use Case | Effect of Preserving SW/SA | |:--|:--| | **Quantization** | Enhances simple round-to-nearest (INT4/INT8) to โ‰ˆ 70โ€“80 % of SmoothQuant quality. | | **Model Compression** | Allows larger block sizes (e.g., 512ร—512) with less degradation. | | **Explainability** | Reveals that a few weights govern semantic token probabilities (stopword suppression). | --- ## ๐Ÿงญ 9. Summary - ๐Ÿงฉ Super Weights exist โ€” a few scalars dominate LLM behavior. - โš™๏ธ They can be found with a single forward pass. - โšก Preserving them is vital for model compression and quantization. - ๐Ÿ“Š Author released a directory of SW coordinates for open LLMs. --- ## ๐Ÿ“š 10. References Yu et al., **โ€œThe Super Weight in Large Language Models,โ€** arXiv:2411.07191v2, 2025. Sun et al., *Massive Activations in Large Language Models*, ICLR Workshop 2024. Dettmers et al., *GPTQ / AWQ / SmoothQuant* (2022โ€“2024). ---