LLMVis / rag_docs /model_selector_guide.md
cdpearlman's picture
New models chosen and rag docs updated
f478beb

Model Selector Guide

How to Choose a Model

The dashboard supports seven transformer models from five architecture families. Select a model from the dropdown menu in the generator section at the top of the page.

Available Models

Model Family Params Layers × Heads Key Feature
GPT-2 (124M) GPT-2 85M 12 × 12 The MI classic — start here
GPT-2 Medium (355M) GPT-2 302M 24 × 16 Scale comparison within GPT-2
GPT-Neo 125M GPT-Neo 85M 12 × 12 Local attention in alternating layers
Pythia-160M Pythia 85M 12 × 12 Rotary PE, parallel attn+MLP
Pythia-410M Pythia 302M 24 × 16 Larger Pythia for scale comparison
OPT-125M OPT 85M 12 × 12 ReLU activation (unique contrast)
Qwen2.5-0.5B (494M) Qwen2 391M 24 × 14 Modern: RMSNorm, SiLU, rotary PE

Architecture Comparisons

These models were chosen to highlight specific architectural differences:

  • Positional encoding: GPT-2, GPT-Neo, and OPT use absolute positions. Pythia and Qwen use rotary (RoPE). Comparing the same prompt across both types shows how PE affects attention patterns.
  • Activation function: Most models use GELU, but OPT uses ReLU and Qwen uses SiLU. This affects MLP behavior.
  • Normalization: GPT-2, Neo, Pythia, and OPT use LayerNorm. Qwen uses RMSNorm.
  • Attention scope: GPT-Neo alternates between local (256-token window) and global attention, unlike all other models.

What Happens When You Load a Model

  1. The model is downloaded from HuggingFace (this may take a moment the first time)
  2. The dashboard auto-detects the model's architecture family
  3. Internal hooks are automatically configured to capture attention patterns, MLP activations, and other data
  4. The layer and head dropdowns are populated based on the model's structure

Tips for Choosing

  • Start with GPT-2: It's small, fast, and the most widely studied. Most educational resources reference GPT-2.
  • Compare same-size models: GPT-2, GPT-Neo 125M, Pythia-160M, and OPT-125M all have 12 layers and 12 heads — differences in their attention patterns come from architecture, not scale.
  • Compare scale: GPT-2 vs GPT-2 Medium (or Pythia-160M vs Pythia-410M) shows how more layers and heads change behavior.
  • Try Qwen2.5 for a modern perspective: It uses an entirely different design philosophy from GPT-2.
  • Memory matters: All dropdown models are small enough for interactive exploration. Larger models can be entered manually but may be slow.

Generation Settings

After selecting a model and entering a prompt, you can configure:

  • Number of Generation Choices (Beams): 1-5 beams. More beams explore more paths but take longer.
  • Number of New Tokens: 1-20 tokens to generate. Shorter is faster.

Click Analyze to run the model and see results in the pipeline and generation sections.