Spaces:

cdpearlman
/

LLMVis

Running

App Files Files Community

LLMVis / rag_docs /model_selector_guide.md

cdpearlman

New models chosen and rag docs updated

f478beb 17 days ago

preview code

raw

history blame contribute delete

3 kB

Model Selector Guide

How to Choose a Model

The dashboard supports seven transformer models from five architecture families. Select a model from the dropdown menu in the generator section at the top of the page.

Available Models

Model	Family	Params	Layers × Heads	Key Feature
GPT-2 (124M)	GPT-2	85M	12 × 12	The MI classic — start here
GPT-2 Medium (355M)	GPT-2	302M	24 × 16	Scale comparison within GPT-2
GPT-Neo 125M	GPT-Neo	85M	12 × 12	Local attention in alternating layers
Pythia-160M	Pythia	85M	12 × 12	Rotary PE, parallel attn+MLP
Pythia-410M	Pythia	302M	24 × 16	Larger Pythia for scale comparison
OPT-125M	OPT	85M	12 × 12	ReLU activation (unique contrast)
Qwen2.5-0.5B (494M)	Qwen2	391M	24 × 14	Modern: RMSNorm, SiLU, rotary PE

Architecture Comparisons

These models were chosen to highlight specific architectural differences:

Positional encoding: GPT-2, GPT-Neo, and OPT use absolute positions. Pythia and Qwen use rotary (RoPE). Comparing the same prompt across both types shows how PE affects attention patterns.
Activation function: Most models use GELU, but OPT uses ReLU and Qwen uses SiLU. This affects MLP behavior.
Normalization: GPT-2, Neo, Pythia, and OPT use LayerNorm. Qwen uses RMSNorm.
Attention scope: GPT-Neo alternates between local (256-token window) and global attention, unlike all other models.

What Happens When You Load a Model

The model is downloaded from HuggingFace (this may take a moment the first time)
The dashboard auto-detects the model's architecture family
Internal hooks are automatically configured to capture attention patterns, MLP activations, and other data
The layer and head dropdowns are populated based on the model's structure

Tips for Choosing

Start with GPT-2: It's small, fast, and the most widely studied. Most educational resources reference GPT-2.
Compare same-size models: GPT-2, GPT-Neo 125M, Pythia-160M, and OPT-125M all have 12 layers and 12 heads — differences in their attention patterns come from architecture, not scale.
Compare scale: GPT-2 vs GPT-2 Medium (or Pythia-160M vs Pythia-410M) shows how more layers and heads change behavior.
Try Qwen2.5 for a modern perspective: It uses an entirely different design philosophy from GPT-2.
Memory matters: All dropdown models are small enough for interactive exploration. Larger models can be entered manually but may be slow.

Generation Settings

After selecting a model and entering a prompt, you can configure:

Number of Generation Choices (Beams): 1-5 beams. More beams explore more paths but take longer.
Number of New Tokens: 1-20 tokens to generate. Shorter is faster.

Click Analyze to run the model and see results in the pipeline and generation sections.