Spaces:

cdpearlman
/

LLMVis

Running

App Files Files Community

LLMVis / rag_docs /model_selector_guide.md

cdpearlman

New models chosen and rag docs updated

f478beb 18 days ago

preview code

raw

history blame contribute delete

3 kB

	# Model Selector Guide

	## How to Choose a Model

	The dashboard supports seven transformer models from five architecture families. Select a model from the dropdown menu in the generator section at the top of the page.

	### Available Models

	\| Model \| Family \| Params \| Layers × Heads \| Key Feature \|
	\|-------\|--------\|--------\|----------------\|-------------\|
	\| GPT-2 (124M) \| GPT-2 \| 85M \| 12 × 12 \| The MI classic — start here \|
	\| GPT-2 Medium (355M) \| GPT-2 \| 302M \| 24 × 16 \| Scale comparison within GPT-2 \|
	\| GPT-Neo 125M \| GPT-Neo \| 85M \| 12 × 12 \| Local attention in alternating layers \|
	\| Pythia-160M \| Pythia \| 85M \| 12 × 12 \| Rotary PE, parallel attn+MLP \|
	\| Pythia-410M \| Pythia \| 302M \| 24 × 16 \| Larger Pythia for scale comparison \|
	\| OPT-125M \| OPT \| 85M \| 12 × 12 \| ReLU activation (unique contrast) \|
	\| Qwen2.5-0.5B (494M) \| Qwen2 \| 391M \| 24 × 14 \| Modern: RMSNorm, SiLU, rotary PE \|

	### Architecture Comparisons

	These models were chosen to highlight specific architectural differences:

	- Positional encoding: GPT-2, GPT-Neo, and OPT use absolute positions. Pythia and Qwen use rotary (RoPE). Comparing the same prompt across both types shows how PE affects attention patterns.
	- Activation function: Most models use GELU, but OPT uses ReLU and Qwen uses SiLU. This affects MLP behavior.
	- Normalization: GPT-2, Neo, Pythia, and OPT use LayerNorm. Qwen uses RMSNorm.
	- Attention scope: GPT-Neo alternates between local (256-token window) and global attention, unlike all other models.

	### What Happens When You Load a Model

	1. The model is downloaded from HuggingFace (this may take a moment the first time)
	2. The dashboard auto-detects the model's architecture family
	3. Internal hooks are automatically configured to capture attention patterns, MLP activations, and other data
	4. The layer and head dropdowns are populated based on the model's structure

	### Tips for Choosing

	- Start with GPT-2: It's small, fast, and the most widely studied. Most educational resources reference GPT-2.
	- Compare same-size models: GPT-2, GPT-Neo 125M, Pythia-160M, and OPT-125M all have 12 layers and 12 heads — differences in their attention patterns come from architecture, not scale.
	- Compare scale: GPT-2 vs GPT-2 Medium (or Pythia-160M vs Pythia-410M) shows how more layers and heads change behavior.
	- Try Qwen2.5 for a modern perspective: It uses an entirely different design philosophy from GPT-2.
	- Memory matters: All dropdown models are small enough for interactive exploration. Larger models can be entered manually but may be slow.

	### Generation Settings

	After selecting a model and entering a prompt, you can configure:
	- Number of Generation Choices (Beams): 1-5 beams. More beams explore more paths but take longer.
	- Number of New Tokens: 1-20 tokens to generate. Shorter is faster.

	Click Analyze to run the model and see results in the pipeline and generation sections.

	# Model Selector Guide

	## How to Choose a Model

	The dashboard supports seven transformer models from five architecture families. Select a model from the dropdown menu in the generator section at the top of the page.

	### Available Models

	\| Model \| Family \| Params \| Layers × Heads \| Key Feature \|
	\|-------\|--------\|--------\|----------------\|-------------\|
	\| GPT-2 (124M) \| GPT-2 \| 85M \| 12 × 12 \| The MI classic — start here \|
	\| GPT-2 Medium (355M) \| GPT-2 \| 302M \| 24 × 16 \| Scale comparison within GPT-2 \|
	\| GPT-Neo 125M \| GPT-Neo \| 85M \| 12 × 12 \| Local attention in alternating layers \|
	\| Pythia-160M \| Pythia \| 85M \| 12 × 12 \| Rotary PE, parallel attn+MLP \|
	\| Pythia-410M \| Pythia \| 302M \| 24 × 16 \| Larger Pythia for scale comparison \|
	\| OPT-125M \| OPT \| 85M \| 12 × 12 \| ReLU activation (unique contrast) \|
	\| Qwen2.5-0.5B (494M) \| Qwen2 \| 391M \| 24 × 14 \| Modern: RMSNorm, SiLU, rotary PE \|

	### Architecture Comparisons

	These models were chosen to highlight specific architectural differences:

	- Positional encoding: GPT-2, GPT-Neo, and OPT use absolute positions. Pythia and Qwen use rotary (RoPE). Comparing the same prompt across both types shows how PE affects attention patterns.
	- Activation function: Most models use GELU, but OPT uses ReLU and Qwen uses SiLU. This affects MLP behavior.
	- Normalization: GPT-2, Neo, Pythia, and OPT use LayerNorm. Qwen uses RMSNorm.
	- Attention scope: GPT-Neo alternates between local (256-token window) and global attention, unlike all other models.

	### What Happens When You Load a Model

	1. The model is downloaded from HuggingFace (this may take a moment the first time)
	2. The dashboard auto-detects the model's architecture family
	3. Internal hooks are automatically configured to capture attention patterns, MLP activations, and other data
	4. The layer and head dropdowns are populated based on the model's structure

	### Tips for Choosing

	- Start with GPT-2: It's small, fast, and the most widely studied. Most educational resources reference GPT-2.
	- Compare same-size models: GPT-2, GPT-Neo 125M, Pythia-160M, and OPT-125M all have 12 layers and 12 heads — differences in their attention patterns come from architecture, not scale.
	- Compare scale: GPT-2 vs GPT-2 Medium (or Pythia-160M vs Pythia-410M) shows how more layers and heads change behavior.
	- Try Qwen2.5 for a modern perspective: It uses an entirely different design philosophy from GPT-2.
	- Memory matters: All dropdown models are small enough for interactive exploration. Larger models can be entered manually but may be slow.

	### Generation Settings

	After selecting a model and entering a prompt, you can configure:
	- Number of Generation Choices (Beams): 1-5 beams. More beams explore more paths but take longer.
	- Number of New Tokens: 1-20 tokens to generate. Shorter is faster.

	Click Analyze to run the model and see results in the pipeline and generation sections.