Spaces:
Running
Running
Todo
Completed: Test Suite Setup (Done)
- Create
tests/folder with__init__.pyandconftest.py(shared fixtures) - Create
test_model_config.py- 15 tests for model family lookups - Create
test_ablation_metrics.py- 8 tests for KL divergence and probability deltas - Create
test_head_detection.py- 20 tests for attention head categorization - Create
test_model_patterns.py- 16 tests for merge_token_probabilities, safe_to_serializable - Create
test_token_attribution.py- 11 tests for visualization data formatting - Verify all 73 tests pass with
pytest tests/ -v
Completed: Pipeline Explanation Refactor
Phase 1: New Components (Done)
- Create
components/pipeline.pywith 5 expandable stages - Create
utils/token_attribution.pywith Integrated Gradients - Create
components/investigation_panel.py(ablation + attribution)
Phase 2: Simplifications (Done)
- Remove comparison UI from
model_selector.py - Refactor
app.py: wire pipeline, remove heatmap/comparison callbacks
Phase 3: Cleanup (Done)
- Delete
main_panel.py - Delete
prompt_comparison.py - Update
utils/__init__.pyexports - Add pipeline CSS styles to
assets/style.css
Completed: Pipeline Clarity Improvements (Agent A)
- Rename "Max New Tokens:" to "Number of New Tokens:" in app.py
- Rename "Beam Width:" to "Number of Generation Choices:" in app.py
- Remove score display from generated sequences in app.py
- Update glossary to clarify "Number of Generation Choices" relates to Beam Search
Completed: Pipeline Clarity Improvements (Agent E)
- Modified output display to show full prompt with predicted token highlighted
- Fixed top-5 tokens hover to show "Token (X%)" format instead of long decimals
- Added Plotly hovertemplate for cleaner hover formatting
Completed: Pipeline Clarity Improvements (Agent C)
- Add educational explanation for embedding stage (pre-learned lookup table concept)
- Add educational explanation for MLP stage (knowledge storage during training)
- Add educational explanation for attention stage (how to interpret BertViz visualization)
Completed: Pipeline Clarity Improvements (Agent B)
- Convert tokenization from horizontal three-column layout to vertical rows
- Each token row shows: [token] β [ID] β [embedding placeholder]
- Maintain existing color scheme and educational tooltips
- Update CSS styles for .tokenization-rows and .token-row layout
- Add responsive styles for mobile (stack on small screens)
Completed: Pipeline Clarity Improvements (Agent D)
- Switch generate_bertviz_html from model_view to head_view in model_patterns.py
- Deprecate _get_top_attended_tokens function (remove usage in extract_layer_data)
- Add generate_head_view_with_categories function for categorized attention heads
- Add get_head_category_counts helper function for UI display
- Run tests to verify no regressions (73 tests pass)
Completed: Pipeline Clarity Improvements (Agent F - Analysis Scope)
- Add session-original-prompt-store and session-selected-beam-store to app.py
- Modify run_generation to analyze ORIGINAL PROMPT only (not generated beam)
- Store beam generation results separately for post-experiment comparison
- Update analyze_selected_sequence to store beam for comparison instead of re-analyzing
- Update ablation experiment to show selected beam context in results
Completed: Pipeline Clarity Improvements (Agent G - Integration & Attention UI)
- Remove deprecated "Most attended tokens" section from attention stage
- Wire head categorization into attention stage UI (shows category counts)
- Add enhanced navigation instructions for BertViz head view
- Verify all 73 tests pass
Completed: UI/UX Fixes (5 Issues)
Issue 1: "Select for Comparison" Button
- Update store_selected_beam callback in app.py to update UI
- Clear all other generated sequences when one is selected
- Display selected sequence with "Selected for Comparison" badge
Issue 2: Tokenization Vertical Layout
- Modify create_tokenization_content in pipeline.py to use vertical layout
- Each row displays: [token] β [ID] with header row
Issue 3: Expandable Attention Categories
- Convert category chips to expandable
<details>elements in pipeline.py - Update app.py to pass full categorize_all_heads() data instead of counts
- Show list of heads (L0-H3, L2-H5, etc.) when category is expanded
Issue 4: BertViz Navigation Instructions
- Add single-click explanation: selects/deselects that head
- Add double-click explanation: selects only that head (deselects others)
Issue 5: Multi-Layer Ablation Head Selection
- Change ablation-selected-heads store to hold [{layer, head}, ...] objects
- Add create_selected_heads_display function in investigation_panel.py
- Show selected heads as chips with "x" buttons to remove
- Update head buttons to show visual selection state per layer
- Preserve selections across layer dropdown changes
- Update run_ablation_experiment to handle multi-layer ablation
- Verify all 73 tests pass
Completed: Fix Multi-Layer Ablation Bug
- Create
execute_forward_pass_with_multi_layer_head_ablationin model_patterns.py - Export new function in utils/init.py
- Replace per-layer ablation loop in app.py with single call to new function
- Add 5 tests for multi-layer ablation in test_model_patterns.py
- Verify all 78 tests pass
Completed: Codebase Cleanup
- Delete unused file:
components/tokenization_panel.py(302 lines, 6 functions) - Remove 6 unused imports from
app.py - Remove deprecated
_get_top_attended_tokens()function from model_patterns.py - Remove
top_attended_tokensfield from extract_layer_data() return values - Remove unused
create_stage_summary()function from pipeline.py - Remove 7 unused utility functions from utils/:
get_check_token_probabilitiesexecute_forward_pass_with_layer_ablationgenerate_category_bertviz_htmlgenerate_head_view_with_categoriescompute_sequence_trajectorycompute_layer_wise_summariescompute_position_layer_matrix
- Update
utils/__init__.pyexports - Update README.md to remove reference to deleted file
Completed: AI Chatbot Integration
- Create
rag_docs/folder with placeholder README for RAG documents - Create
utils/gemini_client.pywith Gemini API wrapper (generate + embed) - Create
utils/rag_utils.pywith document loading, chunking, and retrieval - Create
components/chatbot.pywith UI components (icon, window, messages) - Add chatbot CSS to
assets/style.css(floating button, chat window, message bubbles) - Modify
app.pyto add chat layout and callbacks - Add
google-generativeaitorequirements.txt - Test end-to-end: toggle, message send, context awareness
- Verify all 81 tests pass
Completed: Hugging Face Deployment Prep
- Create
.gitignoreto exclude.env,__pycache__/, etc. - Add
load_dotenv()toapp.pyfor local development - Create
tests/test_gemini_connection.pyto verify API key connectivity - Tests verify: API key is set, can list models, flash model available
- Note: On Hugging Face Spaces, set
GEMINI_API_KEYin Repository Secrets
Completed: Migrate to New Google GenAI SDK (Superseded)
- Update
requirements.txt:google-generativeaiβgoogle-genai>=1.0.0 - Rewrite
utils/gemini_client.pyusing new centralized Client architecture - All 4 connection tests pass
- Verified: embeddings work (3072 dimensions), chat generation works
Completed: Migrate from Gemini to OpenRouter
- Create
utils/openrouter_client.pywith OpenAI-compatible API- Global model config:
DEFAULT_CHAT_MODELandDEFAULT_EMBEDDING_MODEL - Chat via:
POST /api/v1/chat/completions - Embeddings via:
POST /api/v1/embeddings
- Global model config:
- Update
utils/rag_utils.pyimports to use openrouter_client - Update
app.pyimports to use openrouter_client - Create
tests/test_openrouter_connection.pyfor API connectivity tests - Delete old
utils/gemini_client.pyandtests/test_gemini_connection.py - Update
requirements.txt: removegoogle-genai, addrequests>=2.28.0 - Environment variable:
GEMINI_API_KEYβOPENROUTER_API_KEY
Completed: Switch to Paid OpenRouter Models (Cost-Optimized)
- Evaluate OpenRouter models for chatbot use case (cost vs quality)
- Switch chat model:
google/gemini-2.5-flash-lite- $0.10/$0.40 per 1M tokens (input/output)
- 1M context window, 318 tok/s, multimodal
- Switch embedding model:
openai/text-embedding-3-small- $0.02 per 1M tokens
- 1536 dimensions, high quality
- Remove local
sentence-transformersdependency (simpler, no TF conflicts) - Estimated cost: ~$1.50/month for moderate usage
Completed: Enhance RAG Documents for Chatbot
- Category 1: 8 general LLM/Transformer knowledge files (what_is_an_llm.md through key_terminology.md)
- Category 2: 7 dashboard component documentation files (dashboard_overview.md through model_selector_guide.md)
- Category 3: 3 model-specific documentation files (gpt2_overview.md, llama_overview.md, opt_overview.md)
- Category 4: 6 step-by-step guided experiment files (experiment_first_analysis.md through experiment_beam_search.md)
- Category 5: 6 interpretation/troubleshooting/research files (interpreting_*.md, troubleshooting_and_faq.md, recommended_starting_points.md, mechanistic_interpretability_intro.md)
- Delete embeddings_cache.json, update rag_docs/README.md with full inventory
- Update todo.md and conductor docs
- Total: 30 RAG documents covering transformer concepts, dashboard usage, guided experiments, interpretation, troubleshooting, and research context
Completed: Resizable Chat Window
- Add drag handle div to left edge of chat window in chatbot.py
- Add CSS for
.chat-resize-handle(cursor, hover highlight) in style.css - Add
assets/chat_resize.jsfor mousedown/mousemove/mouseup drag logic - Default size unchanged (25vw, 320pxβ450px); drag overrides max-width up to 80vw
Completed: Full-Sequence Attention Visualization
- Modified
run_generation()in app.py: for single-token generation, run forward pass on full beam text (prompt + generated token) instead of prompt only - Enhanced
store_selected_beam()in app.py: addedsession-activation-storeoutput andsession-activation-storeState; re-runsexecute_forward_pass()on selected beam's full text when user picks a beam - Added 5 tests in
test_model_patterns.py(TestFullSequenceAttentionData) verifying attention matrix dimensions match full sequence length - Attention visualization now covers the entire chosen output (input + generated tokens), not just the input prompt
- No changes needed in
model_patterns.py,beam_search.py,pipeline.py, orhead_detection.py
Completed: Output Token Scrubber
- Add
compute_per_position_top5()toutils/model_patterns.pyβ extracts top-5 next-token probabilities at each generated-token position from a single forward pass - Add
original_promptparameter toexecute_forward_pass()β when provided, computes per-position top-5 data and stores in activation_data - Export
compute_per_position_top5inutils/__init__.py - Update
run_generation()in app.py β passesoriginal_prompt=promptfor single-token generation - Update
store_selected_beam()in app.py β reads original prompt from session store and passes to forward pass - Rewrite
create_output_content()incomponents/pipeline.pyβ scrubber mode withdcc.Slider, token display, and top-5 chart; falls back to static mode when no per-position data - Add
_build_token_display()and_build_top5_chart()helpers in pipeline.py - Add
update_output_scrubber()callback in app.py β responds to slider changes, updates token highlight and chart - Update
update_pipeline_content()in app.py β extracts per-position data and passes to output content - Add 10 tests for
compute_per_position_top5intest_model_patterns.py - Fix
conftest.pyto setUSE_TF=0for test import compatibility - All 100 tests pass
- Scrubber shows prompt context (gray) + highlighted token (cyan) + top-5 bar chart at each slider position
- Pre-beam-selection falls back to static output display; scrubber activates after beam selection or single-token generation