# Implementation Tracker: Glass-Box Dashboard **Project:** PhD Study - Making Architecture Transparent for Code Generation **Timeline:** 8 weeks (November 2025 - December 2025) **Status:** Week 1 - In Progress **Last Updated:** 2025-11-01 --- ## Overview This document tracks progress through the 8-week implementation plan outlined in the PhD Study Specification. Each week has specific deliverables, acceptance criteria, and links to relevant code/files. --- ## Week 1-2: Core Model Instrumentation **Goal:** Implement PyTorch hooks, tokenizer instrumentation, zarr storage, and minimal API endpoint. **Status:** 🟡 In Progress ### Tasks #### 1.1 PyTorch Hooks for Attention & Residuals - [ ] Add forward hooks to capture attention tensors `A[L,H,T,T]` - [ ] Capture residual norms `||x_l||` per layer - [ ] Capture logits, logprobs, entropy per token - [ ] Record timing per layer (latency profiling) - [ ] Optional: FFN activations for future SAE integration **Files:** `/backend/model_service.py`, `/backend/instrumentation.py` (new) **Acceptance Criteria:** - Attention tensors stored with shape (num_layers, num_heads, seq_len, seq_len) - Residual norms array with shape (num_layers, seq_len) - Per-token metadata includes logprob, entropy, timing - Latency per layer < 10ms overhead on avg **Notes:** --- #### 1.2 Tokenizer Instrumentation - [ ] Capture BPE/SentencePiece subword splits - [ ] Record byte length per token - [ ] Store token IDs and text - [ ] Identify multi-split identifiers (≥3 subwords) **Files:** `/backend/tokenizer_utils.py` (new) **Acceptance Criteria:** - Each token has `bpe: [subword1, subword2, ...]` field - Byte length calculated correctly (matches `len(token.encode('utf-8'))`) - Multi-split identifiers flagged with `multi_split: true` **Notes:** --- #### 1.3 Zarr/Memmap Storage Layer - [ ] Implement zarr writer with chunking strategy `(layer, head)` - [ ] Create directory structure: `runs/{run_id}/tensors/` - [ ] Store attention, residuals, logits as zarr arrays - [ ] Implement lazy loading for frontend access **Files:** `/backend/storage.py` (new), `/backend/zarr_utils.py` (new) **Acceptance Criteria:** - Zarr arrays created with correct chunking - File size reasonable (< 500MB for 512 token generation with 32 layers) - Load time < 50ms for single layer/head slice - Compression ratio > 3x (use Blosc) **Notes:** --- #### 1.4 Minimal API Endpoint `/analyze/study` - [ ] Create POST endpoint accepting prompt + generation params - [ ] Generate Run ID (format: `R{date}-{time}-{hash}`) - [ ] Implement deterministic generation (fixed seed) - [ ] Return minimal data contract JSON - [ ] Store telemetry (JSONL format) **Files:** `/backend/model_service.py` **API Contract:** ```json POST /analyze/study { "prompt": "def factorial(n):", "max_tokens": 50, "seed": 42, "temperature": 0.0, "instrumentation": ["attention", "residuals", "tokenizer"] } Response: { "run_id": "R2025-11-01-1430-a7f3", "tokens": [...], // minimal data contract "tensor_path": "runs/R2025-11-01-1430-a7f3/tensors/", "telemetry_path": "runs/R2025-11-01-1430-a7f3/telemetry.jsonl" } ``` **Acceptance Criteria:** - Endpoint returns in < 5s for 50-token generation - Run ID is unique and reproducible with same seed - Telemetry JSONL created with `run.start` and `run.end` events - Tensors stored in zarr format **Notes:** --- #### 1.5 Attention Rollout & Head Ranking - [ ] Implement attention rollout algorithm (Kovaleva-style) - [ ] Rank heads by rollout contribution (top-k = 20) - [ ] Store head rankings in Run ID metadata **Files:** `/backend/attention_analysis.py` (new) **Acceptance Criteria:** - Rollout matrix computed efficiently (< 100ms for 512 tokens) - Top-20 heads identified by max rollout weight - Rankings stored in `runs/{run_id}/metadata.json` **Notes:** --- ### Week 1-2 Acceptance Criteria (Overall) - [ ] All 5 tasks completed - [ ] Latency < 250ms for ≤512 tokens (measured end-to-end) - [ ] Zarr storage working correctly (can reload tensors) - [ ] API endpoint functional (manual test via curl/Postman) - [ ] Run ID reproducibility verified (same seed → same output) ### Blockers - **None yet** ### Decisions Made - **2025-11-01:** Using zarr instead of HDF5 for better chunking and parallel access. --- ## Week 3: Attention Visualization **Goal:** Build interactive attention heatmap, head grid, and rollout toggle. **Status:** 🔴 Not Started ### Tasks #### 3.1 Frontend: Attention Heatmap (WebGL) - [ ] Create `/components/study/AttentionVisualization.tsx` - [ ] Implement WebGL-based heatmap for performance - [ ] Add hover tooltips showing exact attention weights - [ ] Support aggregated (all heads) and per-head views **Files:** `/components/study/AttentionVisualization.tsx` **Acceptance Criteria:** - Renders 512x512 heatmap in < 100ms - Hover shows source token, target token, weight - Toggle between aggregated and per-head **Notes:** --- #### 3.2 Frontend: Head Grid (Layer × Head Matrix) - [ ] Display Layer × Head grid with mini-sparklines - [ ] Show mean attention to token classes (identifiers, operators, etc.) - [ ] Click head → overlay on main heatmap **Files:** `/components/study/HeadGrid.tsx` **Acceptance Criteria:** - Grid renders 32×32 cells in < 50ms - Sparklines show attention distribution - Click interaction works smoothly **Notes:** --- #### 3.3 Attention Rollout Toggle - [ ] Add toggle button: Raw Attention vs Rollout - [ ] Fetch rollout data from backend - [ ] Update heatmap dynamically **Files:** `/components/study/AttentionVisualization.tsx` **Acceptance Criteria:** - Toggle switches view in < 100ms - Rollout data fetched lazily (not on initial load) **Notes:** --- #### 3.4 Interactions: Brush & Pin - [ ] Implement brush selection on context tokens - [ ] Highlight downstream tokens impacted by selection - [ ] Add "pin" button to save source→target pair for ablation **Files:** `/components/study/AttentionVisualization.tsx` **Acceptance Criteria:** - Brush selection responsive (< 50ms) - Pinned pairs visible in sidebar - Pin data passed to Ablation pane **Notes:** --- #### 3.5 Disclaimer & Warnings - [ ] Add text: "Attention is descriptive; causal claims require ablation" - [ ] Warn if temperature > 1.2 or top-k sampling active **Files:** `/components/study/AttentionVisualization.tsx` **Acceptance Criteria:** - Disclaimer visible at top of pane - Warnings shown contextually **Notes:** --- ### Week 3 Acceptance Criteria (Overall) - [ ] Attention visualization fully functional - [ ] Interactive latency < 150ms for all operations - [ ] Cross-links to Ablation pane working - [ ] Manual test with Code Llama 7B (50-token generation) ### Blockers ### Decisions Made --- ## Week 4: Token Size & Confidence Visualization **Goal:** Build token chip bar, entropy sparkline, and risk hotspot flags. **Status:** 🔴 Not Started ### Tasks #### 4.1 Frontend: Token Chip Bar - [ ] Create `/components/study/TokenConfidenceView.tsx` - [ ] Render tokens as chips: width = byte length, opacity = confidence - [ ] Add click handler to show tokenization + top-k alternatives **Files:** `/components/study/TokenConfidenceView.tsx` **Acceptance Criteria:** - Chips render correctly with variable widths - Opacity maps to confidence (1 - entropy or exp(logprob)) - Click shows detailed panel **Notes:** --- #### 4.2 Frontend: Entropy Sparkline - [ ] Add sparkline above/below token bar showing entropy per token - [ ] Highlight peaks (entropy ≥ τ_H, initially 1.5 nats) - [ ] Add calibration toggle (show thresholds for keywords/identifiers/operators) **Files:** `/components/study/TokenConfidenceView.tsx` **Acceptance Criteria:** - Sparkline renders in < 50ms - Peaks clearly visible - Threshold adjustable via slider **Notes:** --- #### 4.3 Risk Hotspot Flags - [ ] Flag identifiers split into ≥3 subwords AND entropy peak - [ ] Display flag icon on token chips - [ ] Compute Bug-risk AUC (requires ground truth bug locations) **Files:** `/components/study/TokenConfidenceView.tsx`, `/backend/risk_analysis.py` (new) **Acceptance Criteria:** - Flags appear on relevant tokens - AUC metric computed (requires pilot data) **Notes:** --- #### 4.4 Top-k Alternatives Panel - [ ] Show top-k alternatives with probabilities on token click - [ ] Display attention snippet (which context tokens justified each alternative) **Files:** `/components/study/TokenConfidenceView.tsx` **Acceptance Criteria:** - Panel shows top-3 alternatives minimum - Attention snippet links to Attention visualization **Notes:** --- #### 4.5 Cost/Latency Estimator - [ ] Add widget showing cumulative decoding time - [ ] Estimate API cost (tokens × price per token) **Files:** `/components/study/TokenConfidenceView.tsx` **Acceptance Criteria:** - Time displayed in ms - Cost displayed in USD (or N/A for local) **Notes:** --- ### Week 4 Acceptance Criteria (Overall) - [ ] Token Size & Confidence view functional - [ ] Risk hotspots flagged correctly - [ ] Interactive latency < 150ms - [ ] Manual test with Code Llama 7B ### Blockers ### Decisions Made --- ## Week 5: Ablation Visualization **Goal:** Build interactive ablation controls with head toggles, layer bypass, and diff viewer. **Status:** 🔴 Not Started ### Tasks #### 5.1 Backend: Ablation Engine - [ ] Implement head masking (zero out or uniform attention) - [ ] Implement layer bypass (skip layer, pass residual through) - [ ] Support token constraints (force/ban specific tokens) - [ ] Add surrogate regressor for predicted Δlog-prob **Files:** `/backend/ablation_engine.py` (new) **Acceptance Criteria:** - Ablation runs in < 3s for single head mask - Surrogate predictor accuracy > 70% (train on 100 samples) - Queue system for background ablation execution **Notes:** --- #### 5.2 Frontend: Head Toggle Matrix - [ ] Create `/components/study/AblationView.tsx` - [ ] Display Layer × Head matrix with checkboxes - [ ] Show only top-20 heads (from Week 1-2 ranking) **Files:** `/components/study/AblationView.tsx` **Acceptance Criteria:** - Matrix renders in < 50ms - Checkboxes responsive - Selected heads highlighted **Notes:** --- #### 5.3 Frontend: Diff Viewer - [ ] Show unified diff between baseline and ablated output - [ ] Highlight changed tokens (color-coded: added/removed/modified) - [ ] Display code-aware metrics (tests passed, AST parse, lints) **Files:** `/components/study/AblationView.tsx` **Acceptance Criteria:** - Diff renders clearly - Metrics displayed prominently - Color-coding accessible (colorblind-friendly) **Notes:** --- #### 5.4 Frontend: Per-Token Delta Heat - [ ] Show Δlog-prob and Δentropy per token - [ ] Display as small multiples for most-impactful heads **Files:** `/components/study/AblationView.tsx` **Acceptance Criteria:** - Delta heat visible - Most-impactful heads identified (Δlog-prob ≥ τ_Δ) **Notes:** --- #### 5.5 Integration with Attention View - [ ] Accept pinned source→target pairs from Attention view - [ ] Auto-suggest heads to ablate based on attention weights **Files:** `/components/study/AblationView.tsx` **Acceptance Criteria:** - Pinned pairs appear in Ablation pane - Suggested heads shown with explanation **Notes:** --- ### Week 5 Acceptance Criteria (Overall) - [ ] Ablation view functional - [ ] Head masking works correctly (verified with manual test) - [ ] Diff viewer shows meaningful changes - [ ] Code-aware metrics computed (AST, tests, lints) ### Blockers ### Decisions Made --- ## Week 6: Pipeline Visualization **Goal:** Build swimlane timeline with residual-z, entropy shift, and layer signals. **Status:** 🔴 Not Started ### Tasks #### 6.1 Backend: Layer-Level Signals - [ ] Compute residual-norm z-scores - [ ] Compute entropy shift (pre vs post-layer) - [ ] Compute attention-flow saturation - [ ] Optional: router load for MoE models **Files:** `/backend/pipeline_analysis.py` (new) **Acceptance Criteria:** - Signals computed in < 50ms - Residual-z outliers flagged (> 2σ) - Entropy shifts tracked per layer **Notes:** --- #### 6.2 Frontend: Swimlane Timeline - [ ] Create `/components/study/PipelineView.tsx` - [ ] Display lanes: Tokenizer → Embeddings → Layers → Logits → Sampler → Tests - [ ] Rectangle length = time per stage - [ ] Color intensity = uncertainty (entropy) **Files:** `/components/study/PipelineView.tsx` **Acceptance Criteria:** - Swimlane renders in < 100ms - Hover shows per-stage stats - Timeline scrubber works smoothly **Notes:** --- #### 6.3 Layer Signal Overlays - [ ] Add overlays for residual-z, entropy shift, attention saturation - [ ] Toggle visibility of each signal - [ ] Highlight bottlenecks (top-q percentile of latency/residual-z) **Files:** `/components/study/PipelineView.tsx` **Acceptance Criteria:** - Overlays don't clutter visualization - Bottlenecks clearly marked - Toggle responsive **Notes:** --- #### 6.4 Layer Bypass Interaction - [ ] Add controls to bypass ≤2 layers - [ ] Show predicted impact (via surrogate) - [ ] Execute queued ablation **Files:** `/components/study/PipelineView.tsx` **Acceptance Criteria:** - Bypass controls accessible - Predicted impact shown before execution - Ablation queued in background **Notes:** --- #### 6.5 Cross-Links to Other Views - [ ] Click token → highlight in Attention and Token Confidence views - [ ] Integrated telemetry (track hover/click events) **Files:** `/components/study/PipelineView.tsx` **Acceptance Criteria:** - Cross-highlighting works - Telemetry logged **Notes:** --- ### Week 6 Acceptance Criteria (Overall) - [ ] Pipeline view functional - [ ] Layer signals computed correctly - [ ] Interactive latency < 150ms - [ ] Manual test with Code Llama 7B ### Blockers ### Decisions Made --- ## Week 7: Pilot Study (n=3) **Goal:** Run pilot with 3 participants; tune thresholds; validate latency; gather feedback. **Status:** 🔴 Not Started ### Tasks #### 7.1 Recruit Pilot Participants - [ ] Identify 3 software engineers (varied experience levels) - [ ] Schedule 90-minute sessions **Acceptance Criteria:** - 3 participants confirmed - Availability scheduled **Notes:** --- #### 7.2 Prepare Study Materials - [ ] Task T1: Code completion (sanitize_sql_like) - [ ] Task T2: Bug fix (reverse_string) - [ ] Pre-survey (demographics, LLM familiarity) - [ ] Post-task mini-survey (SCS, Trust, NASA-TLX) - [ ] Interview questions **Files:** `/docs/pilot-study-materials.md` (new) **Acceptance Criteria:** - Materials ready to distribute - Survey forms created (Google Forms or similar) **Notes:** --- #### 7.3 Run Pilot Sessions - [ ] Session 1: Participant P01 - [ ] Session 2: Participant P02 - [ ] Session 3: Participant P03 **Acceptance Criteria:** - All 3 sessions completed - Telemetry logged - Surveys completed **Notes:** --- #### 7.4 Analyze Pilot Data & Tune Thresholds - [ ] Compute latency statistics (mean, p95) - [ ] Tune τ_H (entropy threshold) for ~90% specificity - [ ] Tune τ_Δ (log-prob delta) for ablation sensitivity - [ ] Tune τ_z (residual-norm outlier) **Files:** `/docs/pilot-analysis.md` (new) **Acceptance Criteria:** - Thresholds tuned based on pilot data - Latency < 250ms (if not, optimize) - Survey completion rate ≥ 90% **Notes:** --- #### 7.5 Iterate on UX - [ ] Add tooltips/warnings based on pilot feedback - [ ] Fix any UX issues (confusing interactions, unclear labels) - [ ] Update documentation **Acceptance Criteria:** - At least 2 UX improvements implemented - Pilot participants' feedback documented **Notes:** --- ### Week 7 Acceptance Criteria (Overall) - [ ] Pilot study completed successfully - [ ] Thresholds tuned - [ ] Latency validated (< 250ms) - [ ] UX improvements identified and implemented ### Blockers ### Decisions Made --- ## Week 8: Main Study Preparation **Goal:** Finalize study tooling, prepare OSF pre-registration, and set up participant recruitment. **Status:** 🔴 Not Started ### Tasks #### 8.1 Survey Integration - [ ] Integrate SUS, NASA-TLX, SCS scales into dashboard - [ ] Add pre-survey and post-task mini-surveys - [ ] Export survey data to CSV **Files:** `/components/study/SurveyModal.tsx` (new) **Acceptance Criteria:** - Surveys embedded in dashboard - Data exported correctly **Notes:** --- #### 8.2 Latin Square Counterbalancing - [ ] Implement Latin square assignment for task order - [ ] Randomize condition order (Baseline vs Dashboard) **Files:** `/lib/study-randomization.ts` (new) **Acceptance Criteria:** - Counterbalancing correct (verified manually) - Participant assigned random ID (P01-P24) **Notes:** --- #### 8.3 OSF Pre-Registration - [ ] Complete OSF template (Appendix D from spec) - [ ] Upload task stimuli, exclusion criteria - [ ] Submit pre-registration **Files:** `/docs/osf-preregistration.md` (copy of Appendix D) **Acceptance Criteria:** - Pre-registration submitted before main study - DOI obtained **Notes:** --- #### 8.4 Export Artifact Bundle - [ ] Create script to package Run ID, tensors, telemetry - [ ] Generate `run_pack_P01.zip` for each participant - [ ] Test import into OSF **Files:** `/scripts/export_artifact.py` (new) **Acceptance Criteria:** - Export script functional - Bundle includes all necessary files - Bundle < 100MB per participant **Notes:** --- #### 8.5 Participant Recruitment - [ ] Prepare recruitment email - [ ] Post to developer communities (Reddit, HackerNews, university mailing lists) - [ ] Target n=18-24 participants **Acceptance Criteria:** - Recruitment materials ready - At least 10 participants confirmed **Notes:** --- ### Week 8 Acceptance Criteria (Overall) - [ ] Study tooling finalized - [ ] OSF pre-registration submitted - [ ] Participant recruitment underway - [ ] Ready to begin main study (Week 9-10) ### Blockers ### Decisions Made --- ## Progress Summary | Week | Status | Completion Date | Notes | |------|--------|----------------|-------| | Week 1-2: Instrumentation | 🟡 In Progress | - | Started 2025-11-01 | | Week 3: Attention Viz | 🔴 Not Started | - | - | | Week 4: Token Confidence Viz | 🔴 Not Started | - | - | | Week 5: Ablation Viz | 🔴 Not Started | - | - | | Week 6: Pipeline Viz | 🔴 Not Started | - | - | | Week 7: Pilot Study | 🔴 Not Started | - | - | | Week 8: Main Study Prep | 🔴 Not Started | - | - | **Legend:** - 🟢 Completed - 🟡 In Progress - 🔴 Not Started - 🔵 Blocked --- ## Global Blockers *None currently* --- ## Key Metrics (Target vs Actual) | Metric | Target | Actual | Status | |--------|--------|--------|--------| | Initial render latency (≤512 tokens) | < 250ms | - | - | | Interactive update latency | < 150ms | - | - | | Zarr file size (512 tokens, 32 layers) | < 500MB | - | - | | Zarr load time (single layer/head) | < 50ms | - | - | | Attention rollout computation | < 100ms | - | - | | Ablation execution time | < 3s | - | - | --- ## Notes & Decisions Log ### 2025-11-01 - **Decision:** Using zarr instead of HDF5 for tensor storage due to better chunking and parallel access. - **Decision:** Targeting top-k=20 heads for ablation UI (performance constraint). - **Note:** Started Week 1-2 instrumentation tasks. --- **End of Implementation Tracker**