# OPT Overview ## What Is OPT? OPT (Open Pre-trained Transformer) is a family of language models released by Meta in 2022. OPT was designed to replicate GPT-3's architecture and performance while being openly available to researchers. It uses a decoder-only transformer architecture similar to GPT-2 but with options for much larger sizes. ## Architecture Details OPT's architecture is close to GPT-2 but has some differences: | Property | OPT-125M | OPT-350M | OPT-1.3B | |----------|----------|----------|----------| | Parameters | 125M | 350M | 1.3B | | Layers | 12 | 24 | 24 | | Attention Heads | 12 | 16 | 32 | | Hidden Dimension | 768 | 1024 | 2048 | | Vocabulary Size | 50,272 | 50,272 | 50,272 | ### Key Differences from GPT-2 - **ReLU activation**: OPT uses **ReLU** instead of GPT-2's GELU. This is the only model in the dashboard with ReLU, making it useful for comparing how activation functions affect MLP behavior. - **Learned positional embeddings**: Like GPT-2, OPT uses learned absolute position embeddings (unlike Pythia's or Qwen's RoPE) - **LayerNorm placement**: OPT uses pre-norm LayerNorm (applied before each sublayer), which is slightly different from GPT-2's original arrangement - **Larger variants available**: OPT scales up to 175 billion parameters, though only smaller variants are practical for interactive use ### Similarities to GPT-2 - Same general decoder-only architecture - Same tokenizer style (BPE with ~50K vocabulary) - Same attention mechanism (standard multi-head self-attention) - Similar training objective (next-token prediction) ## What to Expect in the Dashboard When using OPT models: - **OPT-125M is very similar to GPT-2**: Same number of layers (12), heads (12), and hidden dimension (768). You'll see similar attention patterns and predictions. - **Different module paths**: The dashboard auto-detects OPT's internal structure (e.g., `model.decoder.layers.N.self_attn`), so hooking works automatically. - **Tokenization**: OPT's tokenizer is very similar to GPT-2's, so the same text usually produces similar (but not identical) token sequences. - **Good for comparison**: Running the same prompt on GPT-2 and OPT-125M can show how similar architectures with different training data and activation functions produce different predictions. ## HuggingFace Model IDs - `facebook/opt-125m` (in dropdown) - `facebook/opt-350m`, `facebook/opt-1.3b` (larger, enter manually)