# OPT Overview

## What Is OPT?

OPT (Open Pre-trained Transformer) is a family of language models released by Meta in 2022. OPT was designed to replicate GPT-3's architecture and performance while being openly available to researchers. It uses a decoder-only transformer architecture similar to GPT-2 but with options for much larger sizes.

## Architecture Details

OPT's architecture is close to GPT-2 but has some differences:

| Property | OPT-125M | OPT-350M | OPT-1.3B |
|----------|----------|----------|----------|
| Parameters | 125M | 350M | 1.3B |
| Layers | 12 | 24 | 24 |
| Attention Heads | 12 | 16 | 32 |
| Hidden Dimension | 768 | 1024 | 2048 |
| Vocabulary Size | 50,272 | 50,272 | 50,272 |

### Key Differences from GPT-2

- **ReLU activation**: OPT uses **ReLU** instead of GPT-2's GELU. This is the only model in the dashboard with ReLU, making it useful for comparing how activation functions affect MLP behavior.
- **Learned positional embeddings**: Like GPT-2, OPT uses learned absolute position embeddings (unlike Pythia's or Qwen's RoPE)
- **LayerNorm placement**: OPT uses pre-norm LayerNorm (applied before each sublayer), which is slightly different from GPT-2's original arrangement
- **Larger variants available**: OPT scales up to 175 billion parameters, though only smaller variants are practical for interactive use

### Similarities to GPT-2

- Same general decoder-only architecture
- Same tokenizer style (BPE with ~50K vocabulary)
- Same attention mechanism (standard multi-head self-attention)
- Similar training objective (next-token prediction)

## What to Expect in the Dashboard

When using OPT models:

- **OPT-125M is very similar to GPT-2**: Same number of layers (12), heads (12), and hidden dimension (768). You'll see similar attention patterns and predictions.
- **Different module paths**: The dashboard auto-detects OPT's internal structure (e.g., `model.decoder.layers.N.self_attn`), so hooking works automatically.
- **Tokenization**: OPT's tokenizer is very similar to GPT-2's, so the same text usually produces similar (but not identical) token sequences.
- **Good for comparison**: Running the same prompt on GPT-2 and OPT-125M can show how similar architectures with different training data and activation functions produce different predictions.

## HuggingFace Model IDs

- `facebook/opt-125m` (in dropdown)
- `facebook/opt-350m`, `facebook/opt-1.3b` (larger, enter manually)