Spaces:
Running
Running
Model Selector Guide
How to Choose a Model
The dashboard supports seven transformer models from five architecture families. Select a model from the dropdown menu in the generator section at the top of the page.
Available Models
| Model | Family | Params | Layers × Heads | Key Feature |
|---|---|---|---|---|
| GPT-2 (124M) | GPT-2 | 85M | 12 × 12 | The MI classic — start here |
| GPT-2 Medium (355M) | GPT-2 | 302M | 24 × 16 | Scale comparison within GPT-2 |
| GPT-Neo 125M | GPT-Neo | 85M | 12 × 12 | Local attention in alternating layers |
| Pythia-160M | Pythia | 85M | 12 × 12 | Rotary PE, parallel attn+MLP |
| Pythia-410M | Pythia | 302M | 24 × 16 | Larger Pythia for scale comparison |
| OPT-125M | OPT | 85M | 12 × 12 | ReLU activation (unique contrast) |
| Qwen2.5-0.5B (494M) | Qwen2 | 391M | 24 × 14 | Modern: RMSNorm, SiLU, rotary PE |
Architecture Comparisons
These models were chosen to highlight specific architectural differences:
- Positional encoding: GPT-2, GPT-Neo, and OPT use absolute positions. Pythia and Qwen use rotary (RoPE). Comparing the same prompt across both types shows how PE affects attention patterns.
- Activation function: Most models use GELU, but OPT uses ReLU and Qwen uses SiLU. This affects MLP behavior.
- Normalization: GPT-2, Neo, Pythia, and OPT use LayerNorm. Qwen uses RMSNorm.
- Attention scope: GPT-Neo alternates between local (256-token window) and global attention, unlike all other models.
What Happens When You Load a Model
- The model is downloaded from HuggingFace (this may take a moment the first time)
- The dashboard auto-detects the model's architecture family
- Internal hooks are automatically configured to capture attention patterns, MLP activations, and other data
- The layer and head dropdowns are populated based on the model's structure
Tips for Choosing
- Start with GPT-2: It's small, fast, and the most widely studied. Most educational resources reference GPT-2.
- Compare same-size models: GPT-2, GPT-Neo 125M, Pythia-160M, and OPT-125M all have 12 layers and 12 heads — differences in their attention patterns come from architecture, not scale.
- Compare scale: GPT-2 vs GPT-2 Medium (or Pythia-160M vs Pythia-410M) shows how more layers and heads change behavior.
- Try Qwen2.5 for a modern perspective: It uses an entirely different design philosophy from GPT-2.
- Memory matters: All dropdown models are small enough for interactive exploration. Larger models can be entered manually but may be slow.
Generation Settings
After selecting a model and entering a prompt, you can configure:
- Number of Generation Choices (Beams): 1-5 beams. More beams explore more paths but take longer.
- Number of New Tokens: 1-20 tokens to generate. Shorter is faster.
Click Analyze to run the model and see results in the pipeline and generation sections.