Spaces:

cdpearlman
/

LLMVis

Sleeping

App Files Files Community

LLMVis / rag_docs /opt_overview.md

cdpearlman

New models chosen and rag docs updated

f478beb about 1 month ago

preview code

raw

history blame contribute delete

2.44 kB

OPT Overview

What Is OPT?

OPT (Open Pre-trained Transformer) is a family of language models released by Meta in 2022. OPT was designed to replicate GPT-3's architecture and performance while being openly available to researchers. It uses a decoder-only transformer architecture similar to GPT-2 but with options for much larger sizes.

Architecture Details

OPT's architecture is close to GPT-2 but has some differences:

Property	OPT-125M	OPT-350M	OPT-1.3B
Parameters	125M	350M	1.3B
Layers	12	24	24
Attention Heads	12	16	32
Hidden Dimension	768	1024	2048
Vocabulary Size	50,272	50,272	50,272

Key Differences from GPT-2

ReLU activation: OPT uses ReLU instead of GPT-2's GELU. This is the only model in the dashboard with ReLU, making it useful for comparing how activation functions affect MLP behavior.
Learned positional embeddings: Like GPT-2, OPT uses learned absolute position embeddings (unlike Pythia's or Qwen's RoPE)
LayerNorm placement: OPT uses pre-norm LayerNorm (applied before each sublayer), which is slightly different from GPT-2's original arrangement
Larger variants available: OPT scales up to 175 billion parameters, though only smaller variants are practical for interactive use

Similarities to GPT-2

Same general decoder-only architecture
Same tokenizer style (BPE with ~50K vocabulary)
Same attention mechanism (standard multi-head self-attention)
Similar training objective (next-token prediction)

What to Expect in the Dashboard

When using OPT models:

OPT-125M is very similar to GPT-2: Same number of layers (12), heads (12), and hidden dimension (768). You'll see similar attention patterns and predictions.
Different module paths: The dashboard auto-detects OPT's internal structure (e.g., model.decoder.layers.N.self_attn), so hooking works automatically.
Tokenization: OPT's tokenizer is very similar to GPT-2's, so the same text usually produces similar (but not identical) token sequences.
Good for comparison: Running the same prompt on GPT-2 and OPT-125M can show how similar architectures with different training data and activation functions produce different predictions.

HuggingFace Model IDs

facebook/opt-125m (in dropdown)
facebook/opt-350m, facebook/opt-1.3b (larger, enter manually)