Spaces:
Running
on
CPU Upgrade
Implementation Tracker: Glass-Box Dashboard
Project: PhD Study - Making Architecture Transparent for Code Generation Timeline: 8 weeks (November 2025 - December 2025) Status: Week 1 - In Progress Last Updated: 2025-11-01
Overview
This document tracks progress through the 8-week implementation plan outlined in the PhD Study Specification. Each week has specific deliverables, acceptance criteria, and links to relevant code/files.
Week 1-2: Core Model Instrumentation
Goal: Implement PyTorch hooks, tokenizer instrumentation, zarr storage, and minimal API endpoint.
Status: 🟡 In Progress
Tasks
1.1 PyTorch Hooks for Attention & Residuals
- Add forward hooks to capture attention tensors
A[L,H,T,T] - Capture residual norms
||x_l||per layer - Capture logits, logprobs, entropy per token
- Record timing per layer (latency profiling)
- Optional: FFN activations for future SAE integration
Files: /backend/model_service.py, /backend/instrumentation.py (new)
Acceptance Criteria:
- Attention tensors stored with shape (num_layers, num_heads, seq_len, seq_len)
- Residual norms array with shape (num_layers, seq_len)
- Per-token metadata includes logprob, entropy, timing
- Latency per layer < 10ms overhead on avg
Notes:
1.2 Tokenizer Instrumentation
- Capture BPE/SentencePiece subword splits
- Record byte length per token
- Store token IDs and text
- Identify multi-split identifiers (≥3 subwords)
Files: /backend/tokenizer_utils.py (new)
Acceptance Criteria:
- Each token has
bpe: [subword1, subword2, ...]field - Byte length calculated correctly (matches
len(token.encode('utf-8'))) - Multi-split identifiers flagged with
multi_split: true
Notes:
1.3 Zarr/Memmap Storage Layer
- Implement zarr writer with chunking strategy
(layer, head) - Create directory structure:
runs/{run_id}/tensors/ - Store attention, residuals, logits as zarr arrays
- Implement lazy loading for frontend access
Files: /backend/storage.py (new), /backend/zarr_utils.py (new)
Acceptance Criteria:
- Zarr arrays created with correct chunking
- File size reasonable (< 500MB for 512 token generation with 32 layers)
- Load time < 50ms for single layer/head slice
- Compression ratio > 3x (use Blosc)
Notes:
1.4 Minimal API Endpoint /analyze/study
- Create POST endpoint accepting prompt + generation params
- Generate Run ID (format:
R{date}-{time}-{hash}) - Implement deterministic generation (fixed seed)
- Return minimal data contract JSON
- Store telemetry (JSONL format)
Files: /backend/model_service.py
API Contract:
POST /analyze/study
{
"prompt": "def factorial(n):",
"max_tokens": 50,
"seed": 42,
"temperature": 0.0,
"instrumentation": ["attention", "residuals", "tokenizer"]
}
Response:
{
"run_id": "R2025-11-01-1430-a7f3",
"tokens": [...], // minimal data contract
"tensor_path": "runs/R2025-11-01-1430-a7f3/tensors/",
"telemetry_path": "runs/R2025-11-01-1430-a7f3/telemetry.jsonl"
}
Acceptance Criteria:
- Endpoint returns in < 5s for 50-token generation
- Run ID is unique and reproducible with same seed
- Telemetry JSONL created with
run.startandrun.endevents - Tensors stored in zarr format
Notes:
1.5 Attention Rollout & Head Ranking
- Implement attention rollout algorithm (Kovaleva-style)
- Rank heads by rollout contribution (top-k = 20)
- Store head rankings in Run ID metadata
Files: /backend/attention_analysis.py (new)
Acceptance Criteria:
- Rollout matrix computed efficiently (< 100ms for 512 tokens)
- Top-20 heads identified by max rollout weight
- Rankings stored in
runs/{run_id}/metadata.json
Notes:
Week 1-2 Acceptance Criteria (Overall)
- All 5 tasks completed
- Latency < 250ms for ≤512 tokens (measured end-to-end)
- Zarr storage working correctly (can reload tensors)
- API endpoint functional (manual test via curl/Postman)
- Run ID reproducibility verified (same seed → same output)
Blockers
- None yet
Decisions Made
- 2025-11-01: Using zarr instead of HDF5 for better chunking and parallel access.
Week 3: Attention Visualization
Goal: Build interactive attention heatmap, head grid, and rollout toggle.
Status: 🔴 Not Started
Tasks
3.1 Frontend: Attention Heatmap (WebGL)
- Create
/components/study/AttentionVisualization.tsx - Implement WebGL-based heatmap for performance
- Add hover tooltips showing exact attention weights
- Support aggregated (all heads) and per-head views
Files: /components/study/AttentionVisualization.tsx
Acceptance Criteria:
- Renders 512x512 heatmap in < 100ms
- Hover shows source token, target token, weight
- Toggle between aggregated and per-head
Notes:
3.2 Frontend: Head Grid (Layer × Head Matrix)
- Display Layer × Head grid with mini-sparklines
- Show mean attention to token classes (identifiers, operators, etc.)
- Click head → overlay on main heatmap
Files: /components/study/HeadGrid.tsx
Acceptance Criteria:
- Grid renders 32×32 cells in < 50ms
- Sparklines show attention distribution
- Click interaction works smoothly
Notes:
3.3 Attention Rollout Toggle
- Add toggle button: Raw Attention vs Rollout
- Fetch rollout data from backend
- Update heatmap dynamically
Files: /components/study/AttentionVisualization.tsx
Acceptance Criteria:
- Toggle switches view in < 100ms
- Rollout data fetched lazily (not on initial load)
Notes:
3.4 Interactions: Brush & Pin
- Implement brush selection on context tokens
- Highlight downstream tokens impacted by selection
- Add "pin" button to save source→target pair for ablation
Files: /components/study/AttentionVisualization.tsx
Acceptance Criteria:
- Brush selection responsive (< 50ms)
- Pinned pairs visible in sidebar
- Pin data passed to Ablation pane
Notes:
3.5 Disclaimer & Warnings
- Add text: "Attention is descriptive; causal claims require ablation"
- Warn if temperature > 1.2 or top-k sampling active
Files: /components/study/AttentionVisualization.tsx
Acceptance Criteria:
- Disclaimer visible at top of pane
- Warnings shown contextually
Notes:
Week 3 Acceptance Criteria (Overall)
- Attention visualization fully functional
- Interactive latency < 150ms for all operations
- Cross-links to Ablation pane working
- Manual test with Code Llama 7B (50-token generation)
Blockers
Decisions Made
Week 4: Token Size & Confidence Visualization
Goal: Build token chip bar, entropy sparkline, and risk hotspot flags.
Status: 🔴 Not Started
Tasks
4.1 Frontend: Token Chip Bar
- Create
/components/study/TokenConfidenceView.tsx - Render tokens as chips: width = byte length, opacity = confidence
- Add click handler to show tokenization + top-k alternatives
Files: /components/study/TokenConfidenceView.tsx
Acceptance Criteria:
- Chips render correctly with variable widths
- Opacity maps to confidence (1 - entropy or exp(logprob))
- Click shows detailed panel
Notes:
4.2 Frontend: Entropy Sparkline
- Add sparkline above/below token bar showing entropy per token
- Highlight peaks (entropy ≥ τ_H, initially 1.5 nats)
- Add calibration toggle (show thresholds for keywords/identifiers/operators)
Files: /components/study/TokenConfidenceView.tsx
Acceptance Criteria:
- Sparkline renders in < 50ms
- Peaks clearly visible
- Threshold adjustable via slider
Notes:
4.3 Risk Hotspot Flags
- Flag identifiers split into ≥3 subwords AND entropy peak
- Display flag icon on token chips
- Compute Bug-risk AUC (requires ground truth bug locations)
Files: /components/study/TokenConfidenceView.tsx, /backend/risk_analysis.py (new)
Acceptance Criteria:
- Flags appear on relevant tokens
- AUC metric computed (requires pilot data)
Notes:
4.4 Top-k Alternatives Panel
- Show top-k alternatives with probabilities on token click
- Display attention snippet (which context tokens justified each alternative)
Files: /components/study/TokenConfidenceView.tsx
Acceptance Criteria:
- Panel shows top-3 alternatives minimum
- Attention snippet links to Attention visualization
Notes:
4.5 Cost/Latency Estimator
- Add widget showing cumulative decoding time
- Estimate API cost (tokens × price per token)
Files: /components/study/TokenConfidenceView.tsx
Acceptance Criteria:
- Time displayed in ms
- Cost displayed in USD (or N/A for local)
Notes:
Week 4 Acceptance Criteria (Overall)
- Token Size & Confidence view functional
- Risk hotspots flagged correctly
- Interactive latency < 150ms
- Manual test with Code Llama 7B
Blockers
Decisions Made
Week 5: Ablation Visualization
Goal: Build interactive ablation controls with head toggles, layer bypass, and diff viewer.
Status: 🔴 Not Started
Tasks
5.1 Backend: Ablation Engine
- Implement head masking (zero out or uniform attention)
- Implement layer bypass (skip layer, pass residual through)
- Support token constraints (force/ban specific tokens)
- Add surrogate regressor for predicted Δlog-prob
Files: /backend/ablation_engine.py (new)
Acceptance Criteria:
- Ablation runs in < 3s for single head mask
- Surrogate predictor accuracy > 70% (train on 100 samples)
- Queue system for background ablation execution
Notes:
5.2 Frontend: Head Toggle Matrix
- Create
/components/study/AblationView.tsx - Display Layer × Head matrix with checkboxes
- Show only top-20 heads (from Week 1-2 ranking)
Files: /components/study/AblationView.tsx
Acceptance Criteria:
- Matrix renders in < 50ms
- Checkboxes responsive
- Selected heads highlighted
Notes:
5.3 Frontend: Diff Viewer
- Show unified diff between baseline and ablated output
- Highlight changed tokens (color-coded: added/removed/modified)
- Display code-aware metrics (tests passed, AST parse, lints)
Files: /components/study/AblationView.tsx
Acceptance Criteria:
- Diff renders clearly
- Metrics displayed prominently
- Color-coding accessible (colorblind-friendly)
Notes:
5.4 Frontend: Per-Token Delta Heat
- Show Δlog-prob and Δentropy per token
- Display as small multiples for most-impactful heads
Files: /components/study/AblationView.tsx
Acceptance Criteria:
- Delta heat visible
- Most-impactful heads identified (Δlog-prob ≥ τ_Δ)
Notes:
5.5 Integration with Attention View
- Accept pinned source→target pairs from Attention view
- Auto-suggest heads to ablate based on attention weights
Files: /components/study/AblationView.tsx
Acceptance Criteria:
- Pinned pairs appear in Ablation pane
- Suggested heads shown with explanation
Notes:
Week 5 Acceptance Criteria (Overall)
- Ablation view functional
- Head masking works correctly (verified with manual test)
- Diff viewer shows meaningful changes
- Code-aware metrics computed (AST, tests, lints)
Blockers
Decisions Made
Week 6: Pipeline Visualization
Goal: Build swimlane timeline with residual-z, entropy shift, and layer signals.
Status: 🔴 Not Started
Tasks
6.1 Backend: Layer-Level Signals
- Compute residual-norm z-scores
- Compute entropy shift (pre vs post-layer)
- Compute attention-flow saturation
- Optional: router load for MoE models
Files: /backend/pipeline_analysis.py (new)
Acceptance Criteria:
- Signals computed in < 50ms
- Residual-z outliers flagged (> 2σ)
- Entropy shifts tracked per layer
Notes:
6.2 Frontend: Swimlane Timeline
- Create
/components/study/PipelineView.tsx - Display lanes: Tokenizer → Embeddings → Layers → Logits → Sampler → Tests
- Rectangle length = time per stage
- Color intensity = uncertainty (entropy)
Files: /components/study/PipelineView.tsx
Acceptance Criteria:
- Swimlane renders in < 100ms
- Hover shows per-stage stats
- Timeline scrubber works smoothly
Notes:
6.3 Layer Signal Overlays
- Add overlays for residual-z, entropy shift, attention saturation
- Toggle visibility of each signal
- Highlight bottlenecks (top-q percentile of latency/residual-z)
Files: /components/study/PipelineView.tsx
Acceptance Criteria:
- Overlays don't clutter visualization
- Bottlenecks clearly marked
- Toggle responsive
Notes:
6.4 Layer Bypass Interaction
- Add controls to bypass ≤2 layers
- Show predicted impact (via surrogate)
- Execute queued ablation
Files: /components/study/PipelineView.tsx
Acceptance Criteria:
- Bypass controls accessible
- Predicted impact shown before execution
- Ablation queued in background
Notes:
6.5 Cross-Links to Other Views
- Click token → highlight in Attention and Token Confidence views
- Integrated telemetry (track hover/click events)
Files: /components/study/PipelineView.tsx
Acceptance Criteria:
- Cross-highlighting works
- Telemetry logged
Notes:
Week 6 Acceptance Criteria (Overall)
- Pipeline view functional
- Layer signals computed correctly
- Interactive latency < 150ms
- Manual test with Code Llama 7B
Blockers
Decisions Made
Week 7: Pilot Study (n=3)
Goal: Run pilot with 3 participants; tune thresholds; validate latency; gather feedback.
Status: 🔴 Not Started
Tasks
7.1 Recruit Pilot Participants
- Identify 3 software engineers (varied experience levels)
- Schedule 90-minute sessions
Acceptance Criteria:
- 3 participants confirmed
- Availability scheduled
Notes:
7.2 Prepare Study Materials
- Task T1: Code completion (sanitize_sql_like)
- Task T2: Bug fix (reverse_string)
- Pre-survey (demographics, LLM familiarity)
- Post-task mini-survey (SCS, Trust, NASA-TLX)
- Interview questions
Files: /docs/pilot-study-materials.md (new)
Acceptance Criteria:
- Materials ready to distribute
- Survey forms created (Google Forms or similar)
Notes:
7.3 Run Pilot Sessions
- Session 1: Participant P01
- Session 2: Participant P02
- Session 3: Participant P03
Acceptance Criteria:
- All 3 sessions completed
- Telemetry logged
- Surveys completed
Notes:
7.4 Analyze Pilot Data & Tune Thresholds
- Compute latency statistics (mean, p95)
- Tune τ_H (entropy threshold) for ~90% specificity
- Tune τ_Δ (log-prob delta) for ablation sensitivity
- Tune τ_z (residual-norm outlier)
Files: /docs/pilot-analysis.md (new)
Acceptance Criteria:
- Thresholds tuned based on pilot data
- Latency < 250ms (if not, optimize)
- Survey completion rate ≥ 90%
Notes:
7.5 Iterate on UX
- Add tooltips/warnings based on pilot feedback
- Fix any UX issues (confusing interactions, unclear labels)
- Update documentation
Acceptance Criteria:
- At least 2 UX improvements implemented
- Pilot participants' feedback documented
Notes:
Week 7 Acceptance Criteria (Overall)
- Pilot study completed successfully
- Thresholds tuned
- Latency validated (< 250ms)
- UX improvements identified and implemented
Blockers
Decisions Made
Week 8: Main Study Preparation
Goal: Finalize study tooling, prepare OSF pre-registration, and set up participant recruitment.
Status: 🔴 Not Started
Tasks
8.1 Survey Integration
- Integrate SUS, NASA-TLX, SCS scales into dashboard
- Add pre-survey and post-task mini-surveys
- Export survey data to CSV
Files: /components/study/SurveyModal.tsx (new)
Acceptance Criteria:
- Surveys embedded in dashboard
- Data exported correctly
Notes:
8.2 Latin Square Counterbalancing
- Implement Latin square assignment for task order
- Randomize condition order (Baseline vs Dashboard)
Files: /lib/study-randomization.ts (new)
Acceptance Criteria:
- Counterbalancing correct (verified manually)
- Participant assigned random ID (P01-P24)
Notes:
8.3 OSF Pre-Registration
- Complete OSF template (Appendix D from spec)
- Upload task stimuli, exclusion criteria
- Submit pre-registration
Files: /docs/osf-preregistration.md (copy of Appendix D)
Acceptance Criteria:
- Pre-registration submitted before main study
- DOI obtained
Notes:
8.4 Export Artifact Bundle
- Create script to package Run ID, tensors, telemetry
- Generate
run_pack_P01.zipfor each participant - Test import into OSF
Files: /scripts/export_artifact.py (new)
Acceptance Criteria:
- Export script functional
- Bundle includes all necessary files
- Bundle < 100MB per participant
Notes:
8.5 Participant Recruitment
- Prepare recruitment email
- Post to developer communities (Reddit, HackerNews, university mailing lists)
- Target n=18-24 participants
Acceptance Criteria:
- Recruitment materials ready
- At least 10 participants confirmed
Notes:
Week 8 Acceptance Criteria (Overall)
- Study tooling finalized
- OSF pre-registration submitted
- Participant recruitment underway
- Ready to begin main study (Week 9-10)
Blockers
Decisions Made
Progress Summary
| Week | Status | Completion Date | Notes |
|---|---|---|---|
| Week 1-2: Instrumentation | 🟡 In Progress | - | Started 2025-11-01 |
| Week 3: Attention Viz | 🔴 Not Started | - | - |
| Week 4: Token Confidence Viz | 🔴 Not Started | - | - |
| Week 5: Ablation Viz | 🔴 Not Started | - | - |
| Week 6: Pipeline Viz | 🔴 Not Started | - | - |
| Week 7: Pilot Study | 🔴 Not Started | - | - |
| Week 8: Main Study Prep | 🔴 Not Started | - | - |
Legend:
- 🟢 Completed
- 🟡 In Progress
- 🔴 Not Started
- 🔵 Blocked
Global Blockers
None currently
Key Metrics (Target vs Actual)
| Metric | Target | Actual | Status |
|---|---|---|---|
| Initial render latency (≤512 tokens) | < 250ms | - | - |
| Interactive update latency | < 150ms | - | - |
| Zarr file size (512 tokens, 32 layers) | < 500MB | - | - |
| Zarr load time (single layer/head) | < 50ms | - | - |
| Attention rollout computation | < 100ms | - | - |
| Ablation execution time | < 3s | - | - |
Notes & Decisions Log
2025-11-01
- Decision: Using zarr instead of HDF5 for tensor storage due to better chunking and parallel access.
- Decision: Targeting top-k=20 heads for ablation UI (performance constraint).
- Note: Started Week 1-2 instrumentation tasks.
End of Implementation Tracker