api / docs /implementation-tracker.md
gary-boon
Add research attention analysis endpoints with Q/K/V extraction
37ed739

Implementation Tracker: Glass-Box Dashboard

Project: PhD Study - Making Architecture Transparent for Code Generation Timeline: 8 weeks (November 2025 - December 2025) Status: Week 1 - In Progress Last Updated: 2025-11-01


Overview

This document tracks progress through the 8-week implementation plan outlined in the PhD Study Specification. Each week has specific deliverables, acceptance criteria, and links to relevant code/files.


Week 1-2: Core Model Instrumentation

Goal: Implement PyTorch hooks, tokenizer instrumentation, zarr storage, and minimal API endpoint.

Status: 🟡 In Progress

Tasks

1.1 PyTorch Hooks for Attention & Residuals

  • Add forward hooks to capture attention tensors A[L,H,T,T]
  • Capture residual norms ||x_l|| per layer
  • Capture logits, logprobs, entropy per token
  • Record timing per layer (latency profiling)
  • Optional: FFN activations for future SAE integration

Files: /backend/model_service.py, /backend/instrumentation.py (new)

Acceptance Criteria:

  • Attention tensors stored with shape (num_layers, num_heads, seq_len, seq_len)
  • Residual norms array with shape (num_layers, seq_len)
  • Per-token metadata includes logprob, entropy, timing
  • Latency per layer < 10ms overhead on avg

Notes:


1.2 Tokenizer Instrumentation

  • Capture BPE/SentencePiece subword splits
  • Record byte length per token
  • Store token IDs and text
  • Identify multi-split identifiers (≥3 subwords)

Files: /backend/tokenizer_utils.py (new)

Acceptance Criteria:

  • Each token has bpe: [subword1, subword2, ...] field
  • Byte length calculated correctly (matches len(token.encode('utf-8')))
  • Multi-split identifiers flagged with multi_split: true

Notes:


1.3 Zarr/Memmap Storage Layer

  • Implement zarr writer with chunking strategy (layer, head)
  • Create directory structure: runs/{run_id}/tensors/
  • Store attention, residuals, logits as zarr arrays
  • Implement lazy loading for frontend access

Files: /backend/storage.py (new), /backend/zarr_utils.py (new)

Acceptance Criteria:

  • Zarr arrays created with correct chunking
  • File size reasonable (< 500MB for 512 token generation with 32 layers)
  • Load time < 50ms for single layer/head slice
  • Compression ratio > 3x (use Blosc)

Notes:


1.4 Minimal API Endpoint /analyze/study

  • Create POST endpoint accepting prompt + generation params
  • Generate Run ID (format: R{date}-{time}-{hash})
  • Implement deterministic generation (fixed seed)
  • Return minimal data contract JSON
  • Store telemetry (JSONL format)

Files: /backend/model_service.py

API Contract:

POST /analyze/study
{
  "prompt": "def factorial(n):",
  "max_tokens": 50,
  "seed": 42,
  "temperature": 0.0,
  "instrumentation": ["attention", "residuals", "tokenizer"]
}

Response:
{
  "run_id": "R2025-11-01-1430-a7f3",
  "tokens": [...],  // minimal data contract
  "tensor_path": "runs/R2025-11-01-1430-a7f3/tensors/",
  "telemetry_path": "runs/R2025-11-01-1430-a7f3/telemetry.jsonl"
}

Acceptance Criteria:

  • Endpoint returns in < 5s for 50-token generation
  • Run ID is unique and reproducible with same seed
  • Telemetry JSONL created with run.start and run.end events
  • Tensors stored in zarr format

Notes:


1.5 Attention Rollout & Head Ranking

  • Implement attention rollout algorithm (Kovaleva-style)
  • Rank heads by rollout contribution (top-k = 20)
  • Store head rankings in Run ID metadata

Files: /backend/attention_analysis.py (new)

Acceptance Criteria:

  • Rollout matrix computed efficiently (< 100ms for 512 tokens)
  • Top-20 heads identified by max rollout weight
  • Rankings stored in runs/{run_id}/metadata.json

Notes:


Week 1-2 Acceptance Criteria (Overall)

  • All 5 tasks completed
  • Latency < 250ms for ≤512 tokens (measured end-to-end)
  • Zarr storage working correctly (can reload tensors)
  • API endpoint functional (manual test via curl/Postman)
  • Run ID reproducibility verified (same seed → same output)

Blockers

  • None yet

Decisions Made

  • 2025-11-01: Using zarr instead of HDF5 for better chunking and parallel access.

Week 3: Attention Visualization

Goal: Build interactive attention heatmap, head grid, and rollout toggle.

Status: 🔴 Not Started

Tasks

3.1 Frontend: Attention Heatmap (WebGL)

  • Create /components/study/AttentionVisualization.tsx
  • Implement WebGL-based heatmap for performance
  • Add hover tooltips showing exact attention weights
  • Support aggregated (all heads) and per-head views

Files: /components/study/AttentionVisualization.tsx

Acceptance Criteria:

  • Renders 512x512 heatmap in < 100ms
  • Hover shows source token, target token, weight
  • Toggle between aggregated and per-head

Notes:


3.2 Frontend: Head Grid (Layer × Head Matrix)

  • Display Layer × Head grid with mini-sparklines
  • Show mean attention to token classes (identifiers, operators, etc.)
  • Click head → overlay on main heatmap

Files: /components/study/HeadGrid.tsx

Acceptance Criteria:

  • Grid renders 32×32 cells in < 50ms
  • Sparklines show attention distribution
  • Click interaction works smoothly

Notes:


3.3 Attention Rollout Toggle

  • Add toggle button: Raw Attention vs Rollout
  • Fetch rollout data from backend
  • Update heatmap dynamically

Files: /components/study/AttentionVisualization.tsx

Acceptance Criteria:

  • Toggle switches view in < 100ms
  • Rollout data fetched lazily (not on initial load)

Notes:


3.4 Interactions: Brush & Pin

  • Implement brush selection on context tokens
  • Highlight downstream tokens impacted by selection
  • Add "pin" button to save source→target pair for ablation

Files: /components/study/AttentionVisualization.tsx

Acceptance Criteria:

  • Brush selection responsive (< 50ms)
  • Pinned pairs visible in sidebar
  • Pin data passed to Ablation pane

Notes:


3.5 Disclaimer & Warnings

  • Add text: "Attention is descriptive; causal claims require ablation"
  • Warn if temperature > 1.2 or top-k sampling active

Files: /components/study/AttentionVisualization.tsx

Acceptance Criteria:

  • Disclaimer visible at top of pane
  • Warnings shown contextually

Notes:


Week 3 Acceptance Criteria (Overall)

  • Attention visualization fully functional
  • Interactive latency < 150ms for all operations
  • Cross-links to Ablation pane working
  • Manual test with Code Llama 7B (50-token generation)

Blockers

Decisions Made


Week 4: Token Size & Confidence Visualization

Goal: Build token chip bar, entropy sparkline, and risk hotspot flags.

Status: 🔴 Not Started

Tasks

4.1 Frontend: Token Chip Bar

  • Create /components/study/TokenConfidenceView.tsx
  • Render tokens as chips: width = byte length, opacity = confidence
  • Add click handler to show tokenization + top-k alternatives

Files: /components/study/TokenConfidenceView.tsx

Acceptance Criteria:

  • Chips render correctly with variable widths
  • Opacity maps to confidence (1 - entropy or exp(logprob))
  • Click shows detailed panel

Notes:


4.2 Frontend: Entropy Sparkline

  • Add sparkline above/below token bar showing entropy per token
  • Highlight peaks (entropy ≥ τ_H, initially 1.5 nats)
  • Add calibration toggle (show thresholds for keywords/identifiers/operators)

Files: /components/study/TokenConfidenceView.tsx

Acceptance Criteria:

  • Sparkline renders in < 50ms
  • Peaks clearly visible
  • Threshold adjustable via slider

Notes:


4.3 Risk Hotspot Flags

  • Flag identifiers split into ≥3 subwords AND entropy peak
  • Display flag icon on token chips
  • Compute Bug-risk AUC (requires ground truth bug locations)

Files: /components/study/TokenConfidenceView.tsx, /backend/risk_analysis.py (new)

Acceptance Criteria:

  • Flags appear on relevant tokens
  • AUC metric computed (requires pilot data)

Notes:


4.4 Top-k Alternatives Panel

  • Show top-k alternatives with probabilities on token click
  • Display attention snippet (which context tokens justified each alternative)

Files: /components/study/TokenConfidenceView.tsx

Acceptance Criteria:

  • Panel shows top-3 alternatives minimum
  • Attention snippet links to Attention visualization

Notes:


4.5 Cost/Latency Estimator

  • Add widget showing cumulative decoding time
  • Estimate API cost (tokens × price per token)

Files: /components/study/TokenConfidenceView.tsx

Acceptance Criteria:

  • Time displayed in ms
  • Cost displayed in USD (or N/A for local)

Notes:


Week 4 Acceptance Criteria (Overall)

  • Token Size & Confidence view functional
  • Risk hotspots flagged correctly
  • Interactive latency < 150ms
  • Manual test with Code Llama 7B

Blockers

Decisions Made


Week 5: Ablation Visualization

Goal: Build interactive ablation controls with head toggles, layer bypass, and diff viewer.

Status: 🔴 Not Started

Tasks

5.1 Backend: Ablation Engine

  • Implement head masking (zero out or uniform attention)
  • Implement layer bypass (skip layer, pass residual through)
  • Support token constraints (force/ban specific tokens)
  • Add surrogate regressor for predicted Δlog-prob

Files: /backend/ablation_engine.py (new)

Acceptance Criteria:

  • Ablation runs in < 3s for single head mask
  • Surrogate predictor accuracy > 70% (train on 100 samples)
  • Queue system for background ablation execution

Notes:


5.2 Frontend: Head Toggle Matrix

  • Create /components/study/AblationView.tsx
  • Display Layer × Head matrix with checkboxes
  • Show only top-20 heads (from Week 1-2 ranking)

Files: /components/study/AblationView.tsx

Acceptance Criteria:

  • Matrix renders in < 50ms
  • Checkboxes responsive
  • Selected heads highlighted

Notes:


5.3 Frontend: Diff Viewer

  • Show unified diff between baseline and ablated output
  • Highlight changed tokens (color-coded: added/removed/modified)
  • Display code-aware metrics (tests passed, AST parse, lints)

Files: /components/study/AblationView.tsx

Acceptance Criteria:

  • Diff renders clearly
  • Metrics displayed prominently
  • Color-coding accessible (colorblind-friendly)

Notes:


5.4 Frontend: Per-Token Delta Heat

  • Show Δlog-prob and Δentropy per token
  • Display as small multiples for most-impactful heads

Files: /components/study/AblationView.tsx

Acceptance Criteria:

  • Delta heat visible
  • Most-impactful heads identified (Δlog-prob ≥ τ_Δ)

Notes:


5.5 Integration with Attention View

  • Accept pinned source→target pairs from Attention view
  • Auto-suggest heads to ablate based on attention weights

Files: /components/study/AblationView.tsx

Acceptance Criteria:

  • Pinned pairs appear in Ablation pane
  • Suggested heads shown with explanation

Notes:


Week 5 Acceptance Criteria (Overall)

  • Ablation view functional
  • Head masking works correctly (verified with manual test)
  • Diff viewer shows meaningful changes
  • Code-aware metrics computed (AST, tests, lints)

Blockers

Decisions Made


Week 6: Pipeline Visualization

Goal: Build swimlane timeline with residual-z, entropy shift, and layer signals.

Status: 🔴 Not Started

Tasks

6.1 Backend: Layer-Level Signals

  • Compute residual-norm z-scores
  • Compute entropy shift (pre vs post-layer)
  • Compute attention-flow saturation
  • Optional: router load for MoE models

Files: /backend/pipeline_analysis.py (new)

Acceptance Criteria:

  • Signals computed in < 50ms
  • Residual-z outliers flagged (> 2σ)
  • Entropy shifts tracked per layer

Notes:


6.2 Frontend: Swimlane Timeline

  • Create /components/study/PipelineView.tsx
  • Display lanes: Tokenizer → Embeddings → Layers → Logits → Sampler → Tests
  • Rectangle length = time per stage
  • Color intensity = uncertainty (entropy)

Files: /components/study/PipelineView.tsx

Acceptance Criteria:

  • Swimlane renders in < 100ms
  • Hover shows per-stage stats
  • Timeline scrubber works smoothly

Notes:


6.3 Layer Signal Overlays

  • Add overlays for residual-z, entropy shift, attention saturation
  • Toggle visibility of each signal
  • Highlight bottlenecks (top-q percentile of latency/residual-z)

Files: /components/study/PipelineView.tsx

Acceptance Criteria:

  • Overlays don't clutter visualization
  • Bottlenecks clearly marked
  • Toggle responsive

Notes:


6.4 Layer Bypass Interaction

  • Add controls to bypass ≤2 layers
  • Show predicted impact (via surrogate)
  • Execute queued ablation

Files: /components/study/PipelineView.tsx

Acceptance Criteria:

  • Bypass controls accessible
  • Predicted impact shown before execution
  • Ablation queued in background

Notes:


6.5 Cross-Links to Other Views

  • Click token → highlight in Attention and Token Confidence views
  • Integrated telemetry (track hover/click events)

Files: /components/study/PipelineView.tsx

Acceptance Criteria:

  • Cross-highlighting works
  • Telemetry logged

Notes:


Week 6 Acceptance Criteria (Overall)

  • Pipeline view functional
  • Layer signals computed correctly
  • Interactive latency < 150ms
  • Manual test with Code Llama 7B

Blockers

Decisions Made


Week 7: Pilot Study (n=3)

Goal: Run pilot with 3 participants; tune thresholds; validate latency; gather feedback.

Status: 🔴 Not Started

Tasks

7.1 Recruit Pilot Participants

  • Identify 3 software engineers (varied experience levels)
  • Schedule 90-minute sessions

Acceptance Criteria:

  • 3 participants confirmed
  • Availability scheduled

Notes:


7.2 Prepare Study Materials

  • Task T1: Code completion (sanitize_sql_like)
  • Task T2: Bug fix (reverse_string)
  • Pre-survey (demographics, LLM familiarity)
  • Post-task mini-survey (SCS, Trust, NASA-TLX)
  • Interview questions

Files: /docs/pilot-study-materials.md (new)

Acceptance Criteria:

  • Materials ready to distribute
  • Survey forms created (Google Forms or similar)

Notes:


7.3 Run Pilot Sessions

  • Session 1: Participant P01
  • Session 2: Participant P02
  • Session 3: Participant P03

Acceptance Criteria:

  • All 3 sessions completed
  • Telemetry logged
  • Surveys completed

Notes:


7.4 Analyze Pilot Data & Tune Thresholds

  • Compute latency statistics (mean, p95)
  • Tune τ_H (entropy threshold) for ~90% specificity
  • Tune τ_Δ (log-prob delta) for ablation sensitivity
  • Tune τ_z (residual-norm outlier)

Files: /docs/pilot-analysis.md (new)

Acceptance Criteria:

  • Thresholds tuned based on pilot data
  • Latency < 250ms (if not, optimize)
  • Survey completion rate ≥ 90%

Notes:


7.5 Iterate on UX

  • Add tooltips/warnings based on pilot feedback
  • Fix any UX issues (confusing interactions, unclear labels)
  • Update documentation

Acceptance Criteria:

  • At least 2 UX improvements implemented
  • Pilot participants' feedback documented

Notes:


Week 7 Acceptance Criteria (Overall)

  • Pilot study completed successfully
  • Thresholds tuned
  • Latency validated (< 250ms)
  • UX improvements identified and implemented

Blockers

Decisions Made


Week 8: Main Study Preparation

Goal: Finalize study tooling, prepare OSF pre-registration, and set up participant recruitment.

Status: 🔴 Not Started

Tasks

8.1 Survey Integration

  • Integrate SUS, NASA-TLX, SCS scales into dashboard
  • Add pre-survey and post-task mini-surveys
  • Export survey data to CSV

Files: /components/study/SurveyModal.tsx (new)

Acceptance Criteria:

  • Surveys embedded in dashboard
  • Data exported correctly

Notes:


8.2 Latin Square Counterbalancing

  • Implement Latin square assignment for task order
  • Randomize condition order (Baseline vs Dashboard)

Files: /lib/study-randomization.ts (new)

Acceptance Criteria:

  • Counterbalancing correct (verified manually)
  • Participant assigned random ID (P01-P24)

Notes:


8.3 OSF Pre-Registration

  • Complete OSF template (Appendix D from spec)
  • Upload task stimuli, exclusion criteria
  • Submit pre-registration

Files: /docs/osf-preregistration.md (copy of Appendix D)

Acceptance Criteria:

  • Pre-registration submitted before main study
  • DOI obtained

Notes:


8.4 Export Artifact Bundle

  • Create script to package Run ID, tensors, telemetry
  • Generate run_pack_P01.zip for each participant
  • Test import into OSF

Files: /scripts/export_artifact.py (new)

Acceptance Criteria:

  • Export script functional
  • Bundle includes all necessary files
  • Bundle < 100MB per participant

Notes:


8.5 Participant Recruitment

  • Prepare recruitment email
  • Post to developer communities (Reddit, HackerNews, university mailing lists)
  • Target n=18-24 participants

Acceptance Criteria:

  • Recruitment materials ready
  • At least 10 participants confirmed

Notes:


Week 8 Acceptance Criteria (Overall)

  • Study tooling finalized
  • OSF pre-registration submitted
  • Participant recruitment underway
  • Ready to begin main study (Week 9-10)

Blockers

Decisions Made


Progress Summary

Week Status Completion Date Notes
Week 1-2: Instrumentation 🟡 In Progress - Started 2025-11-01
Week 3: Attention Viz 🔴 Not Started - -
Week 4: Token Confidence Viz 🔴 Not Started - -
Week 5: Ablation Viz 🔴 Not Started - -
Week 6: Pipeline Viz 🔴 Not Started - -
Week 7: Pilot Study 🔴 Not Started - -
Week 8: Main Study Prep 🔴 Not Started - -

Legend:

  • 🟢 Completed
  • 🟡 In Progress
  • 🔴 Not Started
  • 🔵 Blocked

Global Blockers

None currently


Key Metrics (Target vs Actual)

Metric Target Actual Status
Initial render latency (≤512 tokens) < 250ms - -
Interactive update latency < 150ms - -
Zarr file size (512 tokens, 32 layers) < 500MB - -
Zarr load time (single layer/head) < 50ms - -
Attention rollout computation < 100ms - -
Ablation execution time < 3s - -

Notes & Decisions Log

2025-11-01

  • Decision: Using zarr instead of HDF5 for tensor storage due to better chunking and parallel access.
  • Decision: Targeting top-k=20 heads for ablation UI (performance constraint).
  • Note: Started Week 1-2 instrumentation tasks.

End of Implementation Tracker