api / docs /implementation-tracker.md
gary-boon
Add research attention analysis endpoints with Q/K/V extraction
37ed739
# Implementation Tracker: Glass-Box Dashboard
**Project:** PhD Study - Making Architecture Transparent for Code Generation
**Timeline:** 8 weeks (November 2025 - December 2025)
**Status:** Week 1 - In Progress
**Last Updated:** 2025-11-01
---
## Overview
This document tracks progress through the 8-week implementation plan outlined in the PhD Study Specification. Each week has specific deliverables, acceptance criteria, and links to relevant code/files.
---
## Week 1-2: Core Model Instrumentation
**Goal:** Implement PyTorch hooks, tokenizer instrumentation, zarr storage, and minimal API endpoint.
**Status:** 🟑 In Progress
### Tasks
#### 1.1 PyTorch Hooks for Attention & Residuals
- [ ] Add forward hooks to capture attention tensors `A[L,H,T,T]`
- [ ] Capture residual norms `||x_l||` per layer
- [ ] Capture logits, logprobs, entropy per token
- [ ] Record timing per layer (latency profiling)
- [ ] Optional: FFN activations for future SAE integration
**Files:** `/backend/model_service.py`, `/backend/instrumentation.py` (new)
**Acceptance Criteria:**
- Attention tensors stored with shape (num_layers, num_heads, seq_len, seq_len)
- Residual norms array with shape (num_layers, seq_len)
- Per-token metadata includes logprob, entropy, timing
- Latency per layer < 10ms overhead on avg
**Notes:**
---
#### 1.2 Tokenizer Instrumentation
- [ ] Capture BPE/SentencePiece subword splits
- [ ] Record byte length per token
- [ ] Store token IDs and text
- [ ] Identify multi-split identifiers (β‰₯3 subwords)
**Files:** `/backend/tokenizer_utils.py` (new)
**Acceptance Criteria:**
- Each token has `bpe: [subword1, subword2, ...]` field
- Byte length calculated correctly (matches `len(token.encode('utf-8'))`)
- Multi-split identifiers flagged with `multi_split: true`
**Notes:**
---
#### 1.3 Zarr/Memmap Storage Layer
- [ ] Implement zarr writer with chunking strategy `(layer, head)`
- [ ] Create directory structure: `runs/{run_id}/tensors/`
- [ ] Store attention, residuals, logits as zarr arrays
- [ ] Implement lazy loading for frontend access
**Files:** `/backend/storage.py` (new), `/backend/zarr_utils.py` (new)
**Acceptance Criteria:**
- Zarr arrays created with correct chunking
- File size reasonable (< 500MB for 512 token generation with 32 layers)
- Load time < 50ms for single layer/head slice
- Compression ratio > 3x (use Blosc)
**Notes:**
---
#### 1.4 Minimal API Endpoint `/analyze/study`
- [ ] Create POST endpoint accepting prompt + generation params
- [ ] Generate Run ID (format: `R{date}-{time}-{hash}`)
- [ ] Implement deterministic generation (fixed seed)
- [ ] Return minimal data contract JSON
- [ ] Store telemetry (JSONL format)
**Files:** `/backend/model_service.py`
**API Contract:**
```json
POST /analyze/study
{
"prompt": "def factorial(n):",
"max_tokens": 50,
"seed": 42,
"temperature": 0.0,
"instrumentation": ["attention", "residuals", "tokenizer"]
}
Response:
{
"run_id": "R2025-11-01-1430-a7f3",
"tokens": [...], // minimal data contract
"tensor_path": "runs/R2025-11-01-1430-a7f3/tensors/",
"telemetry_path": "runs/R2025-11-01-1430-a7f3/telemetry.jsonl"
}
```
**Acceptance Criteria:**
- Endpoint returns in < 5s for 50-token generation
- Run ID is unique and reproducible with same seed
- Telemetry JSONL created with `run.start` and `run.end` events
- Tensors stored in zarr format
**Notes:**
---
#### 1.5 Attention Rollout & Head Ranking
- [ ] Implement attention rollout algorithm (Kovaleva-style)
- [ ] Rank heads by rollout contribution (top-k = 20)
- [ ] Store head rankings in Run ID metadata
**Files:** `/backend/attention_analysis.py` (new)
**Acceptance Criteria:**
- Rollout matrix computed efficiently (< 100ms for 512 tokens)
- Top-20 heads identified by max rollout weight
- Rankings stored in `runs/{run_id}/metadata.json`
**Notes:**
---
### Week 1-2 Acceptance Criteria (Overall)
- [ ] All 5 tasks completed
- [ ] Latency < 250ms for ≀512 tokens (measured end-to-end)
- [ ] Zarr storage working correctly (can reload tensors)
- [ ] API endpoint functional (manual test via curl/Postman)
- [ ] Run ID reproducibility verified (same seed β†’ same output)
### Blockers
- **None yet**
### Decisions Made
- **2025-11-01:** Using zarr instead of HDF5 for better chunking and parallel access.
---
## Week 3: Attention Visualization
**Goal:** Build interactive attention heatmap, head grid, and rollout toggle.
**Status:** πŸ”΄ Not Started
### Tasks
#### 3.1 Frontend: Attention Heatmap (WebGL)
- [ ] Create `/components/study/AttentionVisualization.tsx`
- [ ] Implement WebGL-based heatmap for performance
- [ ] Add hover tooltips showing exact attention weights
- [ ] Support aggregated (all heads) and per-head views
**Files:** `/components/study/AttentionVisualization.tsx`
**Acceptance Criteria:**
- Renders 512x512 heatmap in < 100ms
- Hover shows source token, target token, weight
- Toggle between aggregated and per-head
**Notes:**
---
#### 3.2 Frontend: Head Grid (Layer Γ— Head Matrix)
- [ ] Display Layer Γ— Head grid with mini-sparklines
- [ ] Show mean attention to token classes (identifiers, operators, etc.)
- [ ] Click head β†’ overlay on main heatmap
**Files:** `/components/study/HeadGrid.tsx`
**Acceptance Criteria:**
- Grid renders 32Γ—32 cells in < 50ms
- Sparklines show attention distribution
- Click interaction works smoothly
**Notes:**
---
#### 3.3 Attention Rollout Toggle
- [ ] Add toggle button: Raw Attention vs Rollout
- [ ] Fetch rollout data from backend
- [ ] Update heatmap dynamically
**Files:** `/components/study/AttentionVisualization.tsx`
**Acceptance Criteria:**
- Toggle switches view in < 100ms
- Rollout data fetched lazily (not on initial load)
**Notes:**
---
#### 3.4 Interactions: Brush & Pin
- [ ] Implement brush selection on context tokens
- [ ] Highlight downstream tokens impacted by selection
- [ ] Add "pin" button to save source→target pair for ablation
**Files:** `/components/study/AttentionVisualization.tsx`
**Acceptance Criteria:**
- Brush selection responsive (< 50ms)
- Pinned pairs visible in sidebar
- Pin data passed to Ablation pane
**Notes:**
---
#### 3.5 Disclaimer & Warnings
- [ ] Add text: "Attention is descriptive; causal claims require ablation"
- [ ] Warn if temperature > 1.2 or top-k sampling active
**Files:** `/components/study/AttentionVisualization.tsx`
**Acceptance Criteria:**
- Disclaimer visible at top of pane
- Warnings shown contextually
**Notes:**
---
### Week 3 Acceptance Criteria (Overall)
- [ ] Attention visualization fully functional
- [ ] Interactive latency < 150ms for all operations
- [ ] Cross-links to Ablation pane working
- [ ] Manual test with Code Llama 7B (50-token generation)
### Blockers
### Decisions Made
---
## Week 4: Token Size & Confidence Visualization
**Goal:** Build token chip bar, entropy sparkline, and risk hotspot flags.
**Status:** πŸ”΄ Not Started
### Tasks
#### 4.1 Frontend: Token Chip Bar
- [ ] Create `/components/study/TokenConfidenceView.tsx`
- [ ] Render tokens as chips: width = byte length, opacity = confidence
- [ ] Add click handler to show tokenization + top-k alternatives
**Files:** `/components/study/TokenConfidenceView.tsx`
**Acceptance Criteria:**
- Chips render correctly with variable widths
- Opacity maps to confidence (1 - entropy or exp(logprob))
- Click shows detailed panel
**Notes:**
---
#### 4.2 Frontend: Entropy Sparkline
- [ ] Add sparkline above/below token bar showing entropy per token
- [ ] Highlight peaks (entropy β‰₯ Ο„_H, initially 1.5 nats)
- [ ] Add calibration toggle (show thresholds for keywords/identifiers/operators)
**Files:** `/components/study/TokenConfidenceView.tsx`
**Acceptance Criteria:**
- Sparkline renders in < 50ms
- Peaks clearly visible
- Threshold adjustable via slider
**Notes:**
---
#### 4.3 Risk Hotspot Flags
- [ ] Flag identifiers split into β‰₯3 subwords AND entropy peak
- [ ] Display flag icon on token chips
- [ ] Compute Bug-risk AUC (requires ground truth bug locations)
**Files:** `/components/study/TokenConfidenceView.tsx`, `/backend/risk_analysis.py` (new)
**Acceptance Criteria:**
- Flags appear on relevant tokens
- AUC metric computed (requires pilot data)
**Notes:**
---
#### 4.4 Top-k Alternatives Panel
- [ ] Show top-k alternatives with probabilities on token click
- [ ] Display attention snippet (which context tokens justified each alternative)
**Files:** `/components/study/TokenConfidenceView.tsx`
**Acceptance Criteria:**
- Panel shows top-3 alternatives minimum
- Attention snippet links to Attention visualization
**Notes:**
---
#### 4.5 Cost/Latency Estimator
- [ ] Add widget showing cumulative decoding time
- [ ] Estimate API cost (tokens Γ— price per token)
**Files:** `/components/study/TokenConfidenceView.tsx`
**Acceptance Criteria:**
- Time displayed in ms
- Cost displayed in USD (or N/A for local)
**Notes:**
---
### Week 4 Acceptance Criteria (Overall)
- [ ] Token Size & Confidence view functional
- [ ] Risk hotspots flagged correctly
- [ ] Interactive latency < 150ms
- [ ] Manual test with Code Llama 7B
### Blockers
### Decisions Made
---
## Week 5: Ablation Visualization
**Goal:** Build interactive ablation controls with head toggles, layer bypass, and diff viewer.
**Status:** πŸ”΄ Not Started
### Tasks
#### 5.1 Backend: Ablation Engine
- [ ] Implement head masking (zero out or uniform attention)
- [ ] Implement layer bypass (skip layer, pass residual through)
- [ ] Support token constraints (force/ban specific tokens)
- [ ] Add surrogate regressor for predicted Ξ”log-prob
**Files:** `/backend/ablation_engine.py` (new)
**Acceptance Criteria:**
- Ablation runs in < 3s for single head mask
- Surrogate predictor accuracy > 70% (train on 100 samples)
- Queue system for background ablation execution
**Notes:**
---
#### 5.2 Frontend: Head Toggle Matrix
- [ ] Create `/components/study/AblationView.tsx`
- [ ] Display Layer Γ— Head matrix with checkboxes
- [ ] Show only top-20 heads (from Week 1-2 ranking)
**Files:** `/components/study/AblationView.tsx`
**Acceptance Criteria:**
- Matrix renders in < 50ms
- Checkboxes responsive
- Selected heads highlighted
**Notes:**
---
#### 5.3 Frontend: Diff Viewer
- [ ] Show unified diff between baseline and ablated output
- [ ] Highlight changed tokens (color-coded: added/removed/modified)
- [ ] Display code-aware metrics (tests passed, AST parse, lints)
**Files:** `/components/study/AblationView.tsx`
**Acceptance Criteria:**
- Diff renders clearly
- Metrics displayed prominently
- Color-coding accessible (colorblind-friendly)
**Notes:**
---
#### 5.4 Frontend: Per-Token Delta Heat
- [ ] Show Ξ”log-prob and Ξ”entropy per token
- [ ] Display as small multiples for most-impactful heads
**Files:** `/components/study/AblationView.tsx`
**Acceptance Criteria:**
- Delta heat visible
- Most-impactful heads identified (Ξ”log-prob β‰₯ Ο„_Ξ”)
**Notes:**
---
#### 5.5 Integration with Attention View
- [ ] Accept pinned source→target pairs from Attention view
- [ ] Auto-suggest heads to ablate based on attention weights
**Files:** `/components/study/AblationView.tsx`
**Acceptance Criteria:**
- Pinned pairs appear in Ablation pane
- Suggested heads shown with explanation
**Notes:**
---
### Week 5 Acceptance Criteria (Overall)
- [ ] Ablation view functional
- [ ] Head masking works correctly (verified with manual test)
- [ ] Diff viewer shows meaningful changes
- [ ] Code-aware metrics computed (AST, tests, lints)
### Blockers
### Decisions Made
---
## Week 6: Pipeline Visualization
**Goal:** Build swimlane timeline with residual-z, entropy shift, and layer signals.
**Status:** πŸ”΄ Not Started
### Tasks
#### 6.1 Backend: Layer-Level Signals
- [ ] Compute residual-norm z-scores
- [ ] Compute entropy shift (pre vs post-layer)
- [ ] Compute attention-flow saturation
- [ ] Optional: router load for MoE models
**Files:** `/backend/pipeline_analysis.py` (new)
**Acceptance Criteria:**
- Signals computed in < 50ms
- Residual-z outliers flagged (> 2Οƒ)
- Entropy shifts tracked per layer
**Notes:**
---
#### 6.2 Frontend: Swimlane Timeline
- [ ] Create `/components/study/PipelineView.tsx`
- [ ] Display lanes: Tokenizer β†’ Embeddings β†’ Layers β†’ Logits β†’ Sampler β†’ Tests
- [ ] Rectangle length = time per stage
- [ ] Color intensity = uncertainty (entropy)
**Files:** `/components/study/PipelineView.tsx`
**Acceptance Criteria:**
- Swimlane renders in < 100ms
- Hover shows per-stage stats
- Timeline scrubber works smoothly
**Notes:**
---
#### 6.3 Layer Signal Overlays
- [ ] Add overlays for residual-z, entropy shift, attention saturation
- [ ] Toggle visibility of each signal
- [ ] Highlight bottlenecks (top-q percentile of latency/residual-z)
**Files:** `/components/study/PipelineView.tsx`
**Acceptance Criteria:**
- Overlays don't clutter visualization
- Bottlenecks clearly marked
- Toggle responsive
**Notes:**
---
#### 6.4 Layer Bypass Interaction
- [ ] Add controls to bypass ≀2 layers
- [ ] Show predicted impact (via surrogate)
- [ ] Execute queued ablation
**Files:** `/components/study/PipelineView.tsx`
**Acceptance Criteria:**
- Bypass controls accessible
- Predicted impact shown before execution
- Ablation queued in background
**Notes:**
---
#### 6.5 Cross-Links to Other Views
- [ ] Click token β†’ highlight in Attention and Token Confidence views
- [ ] Integrated telemetry (track hover/click events)
**Files:** `/components/study/PipelineView.tsx`
**Acceptance Criteria:**
- Cross-highlighting works
- Telemetry logged
**Notes:**
---
### Week 6 Acceptance Criteria (Overall)
- [ ] Pipeline view functional
- [ ] Layer signals computed correctly
- [ ] Interactive latency < 150ms
- [ ] Manual test with Code Llama 7B
### Blockers
### Decisions Made
---
## Week 7: Pilot Study (n=3)
**Goal:** Run pilot with 3 participants; tune thresholds; validate latency; gather feedback.
**Status:** πŸ”΄ Not Started
### Tasks
#### 7.1 Recruit Pilot Participants
- [ ] Identify 3 software engineers (varied experience levels)
- [ ] Schedule 90-minute sessions
**Acceptance Criteria:**
- 3 participants confirmed
- Availability scheduled
**Notes:**
---
#### 7.2 Prepare Study Materials
- [ ] Task T1: Code completion (sanitize_sql_like)
- [ ] Task T2: Bug fix (reverse_string)
- [ ] Pre-survey (demographics, LLM familiarity)
- [ ] Post-task mini-survey (SCS, Trust, NASA-TLX)
- [ ] Interview questions
**Files:** `/docs/pilot-study-materials.md` (new)
**Acceptance Criteria:**
- Materials ready to distribute
- Survey forms created (Google Forms or similar)
**Notes:**
---
#### 7.3 Run Pilot Sessions
- [ ] Session 1: Participant P01
- [ ] Session 2: Participant P02
- [ ] Session 3: Participant P03
**Acceptance Criteria:**
- All 3 sessions completed
- Telemetry logged
- Surveys completed
**Notes:**
---
#### 7.4 Analyze Pilot Data & Tune Thresholds
- [ ] Compute latency statistics (mean, p95)
- [ ] Tune Ο„_H (entropy threshold) for ~90% specificity
- [ ] Tune Ο„_Ξ” (log-prob delta) for ablation sensitivity
- [ ] Tune Ο„_z (residual-norm outlier)
**Files:** `/docs/pilot-analysis.md` (new)
**Acceptance Criteria:**
- Thresholds tuned based on pilot data
- Latency < 250ms (if not, optimize)
- Survey completion rate β‰₯ 90%
**Notes:**
---
#### 7.5 Iterate on UX
- [ ] Add tooltips/warnings based on pilot feedback
- [ ] Fix any UX issues (confusing interactions, unclear labels)
- [ ] Update documentation
**Acceptance Criteria:**
- At least 2 UX improvements implemented
- Pilot participants' feedback documented
**Notes:**
---
### Week 7 Acceptance Criteria (Overall)
- [ ] Pilot study completed successfully
- [ ] Thresholds tuned
- [ ] Latency validated (< 250ms)
- [ ] UX improvements identified and implemented
### Blockers
### Decisions Made
---
## Week 8: Main Study Preparation
**Goal:** Finalize study tooling, prepare OSF pre-registration, and set up participant recruitment.
**Status:** πŸ”΄ Not Started
### Tasks
#### 8.1 Survey Integration
- [ ] Integrate SUS, NASA-TLX, SCS scales into dashboard
- [ ] Add pre-survey and post-task mini-surveys
- [ ] Export survey data to CSV
**Files:** `/components/study/SurveyModal.tsx` (new)
**Acceptance Criteria:**
- Surveys embedded in dashboard
- Data exported correctly
**Notes:**
---
#### 8.2 Latin Square Counterbalancing
- [ ] Implement Latin square assignment for task order
- [ ] Randomize condition order (Baseline vs Dashboard)
**Files:** `/lib/study-randomization.ts` (new)
**Acceptance Criteria:**
- Counterbalancing correct (verified manually)
- Participant assigned random ID (P01-P24)
**Notes:**
---
#### 8.3 OSF Pre-Registration
- [ ] Complete OSF template (Appendix D from spec)
- [ ] Upload task stimuli, exclusion criteria
- [ ] Submit pre-registration
**Files:** `/docs/osf-preregistration.md` (copy of Appendix D)
**Acceptance Criteria:**
- Pre-registration submitted before main study
- DOI obtained
**Notes:**
---
#### 8.4 Export Artifact Bundle
- [ ] Create script to package Run ID, tensors, telemetry
- [ ] Generate `run_pack_P01.zip` for each participant
- [ ] Test import into OSF
**Files:** `/scripts/export_artifact.py` (new)
**Acceptance Criteria:**
- Export script functional
- Bundle includes all necessary files
- Bundle < 100MB per participant
**Notes:**
---
#### 8.5 Participant Recruitment
- [ ] Prepare recruitment email
- [ ] Post to developer communities (Reddit, HackerNews, university mailing lists)
- [ ] Target n=18-24 participants
**Acceptance Criteria:**
- Recruitment materials ready
- At least 10 participants confirmed
**Notes:**
---
### Week 8 Acceptance Criteria (Overall)
- [ ] Study tooling finalized
- [ ] OSF pre-registration submitted
- [ ] Participant recruitment underway
- [ ] Ready to begin main study (Week 9-10)
### Blockers
### Decisions Made
---
## Progress Summary
| Week | Status | Completion Date | Notes |
|------|--------|----------------|-------|
| Week 1-2: Instrumentation | 🟑 In Progress | - | Started 2025-11-01 |
| Week 3: Attention Viz | πŸ”΄ Not Started | - | - |
| Week 4: Token Confidence Viz | πŸ”΄ Not Started | - | - |
| Week 5: Ablation Viz | πŸ”΄ Not Started | - | - |
| Week 6: Pipeline Viz | πŸ”΄ Not Started | - | - |
| Week 7: Pilot Study | πŸ”΄ Not Started | - | - |
| Week 8: Main Study Prep | πŸ”΄ Not Started | - | - |
**Legend:**
- 🟒 Completed
- 🟑 In Progress
- πŸ”΄ Not Started
- πŸ”΅ Blocked
---
## Global Blockers
*None currently*
---
## Key Metrics (Target vs Actual)
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| Initial render latency (≀512 tokens) | < 250ms | - | - |
| Interactive update latency | < 150ms | - | - |
| Zarr file size (512 tokens, 32 layers) | < 500MB | - | - |
| Zarr load time (single layer/head) | < 50ms | - | - |
| Attention rollout computation | < 100ms | - | - |
| Ablation execution time | < 3s | - | - |
---
## Notes & Decisions Log
### 2025-11-01
- **Decision:** Using zarr instead of HDF5 for tensor storage due to better chunking and parallel access.
- **Decision:** Targeting top-k=20 heads for ablation UI (performance constraint).
- **Note:** Started Week 1-2 instrumentation tasks.
---
**End of Implementation Tracker**