Spaces:
Running
Running
| # OPT Overview | |
| ## What Is OPT? | |
| OPT (Open Pre-trained Transformer) is a family of language models released by Meta in 2022. OPT was designed to replicate GPT-3's architecture and performance while being openly available to researchers. It uses a decoder-only transformer architecture similar to GPT-2 but with options for much larger sizes. | |
| ## Architecture Details | |
| OPT's architecture is close to GPT-2 but has some differences: | |
| | Property | OPT-125M | OPT-350M | OPT-1.3B | | |
| |----------|----------|----------|----------| | |
| | Parameters | 125M | 350M | 1.3B | | |
| | Layers | 12 | 24 | 24 | | |
| | Attention Heads | 12 | 16 | 32 | | |
| | Hidden Dimension | 768 | 1024 | 2048 | | |
| | Vocabulary Size | 50,272 | 50,272 | 50,272 | | |
| ### Key Differences from GPT-2 | |
| - **ReLU activation**: OPT uses **ReLU** instead of GPT-2's GELU. This is the only model in the dashboard with ReLU, making it useful for comparing how activation functions affect MLP behavior. | |
| - **Learned positional embeddings**: Like GPT-2, OPT uses learned absolute position embeddings (unlike Pythia's or Qwen's RoPE) | |
| - **LayerNorm placement**: OPT uses pre-norm LayerNorm (applied before each sublayer), which is slightly different from GPT-2's original arrangement | |
| - **Larger variants available**: OPT scales up to 175 billion parameters, though only smaller variants are practical for interactive use | |
| ### Similarities to GPT-2 | |
| - Same general decoder-only architecture | |
| - Same tokenizer style (BPE with ~50K vocabulary) | |
| - Same attention mechanism (standard multi-head self-attention) | |
| - Similar training objective (next-token prediction) | |
| ## What to Expect in the Dashboard | |
| When using OPT models: | |
| - **OPT-125M is very similar to GPT-2**: Same number of layers (12), heads (12), and hidden dimension (768). You'll see similar attention patterns and predictions. | |
| - **Different module paths**: The dashboard auto-detects OPT's internal structure (e.g., `model.decoder.layers.N.self_attn`), so hooking works automatically. | |
| - **Tokenization**: OPT's tokenizer is very similar to GPT-2's, so the same text usually produces similar (but not identical) token sequences. | |
| - **Good for comparison**: Running the same prompt on GPT-2 and OPT-125M can show how similar architectures with different training data and activation functions produce different predictions. | |
| ## HuggingFace Model IDs | |
| - `facebook/opt-125m` (in dropdown) | |
| - `facebook/opt-350m`, `facebook/opt-1.3b` (larger, enter manually) | |