Spaces:

visualisable-ai
/

api

Sleeping

App Files Files Community

api / docs /implementation-tracker.md

gary-boon

Add research attention analysis endpoints with Q/K/V extraction

37ed739 about 1 month ago

preview code

raw

history blame contribute delete

19.3 kB

	# Implementation Tracker: Glass-Box Dashboard

	Project: PhD Study - Making Architecture Transparent for Code Generation
	Timeline: 8 weeks (November 2025 - December 2025)
	Status: Week 1 - In Progress
	Last Updated: 2025-11-01

	---

	## Overview

	This document tracks progress through the 8-week implementation plan outlined in the PhD Study Specification. Each week has specific deliverables, acceptance criteria, and links to relevant code/files.

	---

	## Week 1-2: Core Model Instrumentation

	Goal: Implement PyTorch hooks, tokenizer instrumentation, zarr storage, and minimal API endpoint.

	Status: 🟡 In Progress

	### Tasks

	#### 1.1 PyTorch Hooks for Attention & Residuals
	- [ ] Add forward hooks to capture attention tensors `A[L,H,T,T]`
	- [ ] Capture residual norms `\|\|x_l\|\|` per layer
	- [ ] Capture logits, logprobs, entropy per token
	- [ ] Record timing per layer (latency profiling)
	- [ ] Optional: FFN activations for future SAE integration

	Files: `/backend/model_service.py`, `/backend/instrumentation.py` (new)

	Acceptance Criteria:
	- Attention tensors stored with shape (num_layers, num_heads, seq_len, seq_len)
	- Residual norms array with shape (num_layers, seq_len)
	- Per-token metadata includes logprob, entropy, timing
	- Latency per layer < 10ms overhead on avg

	Notes:

	---

	#### 1.2 Tokenizer Instrumentation
	- [ ] Capture BPE/SentencePiece subword splits
	- [ ] Record byte length per token
	- [ ] Store token IDs and text
	- [ ] Identify multi-split identifiers (≥3 subwords)

	Files: `/backend/tokenizer_utils.py` (new)

	Acceptance Criteria:
	- Each token has `bpe: [subword1, subword2, ...]` field
	- Byte length calculated correctly (matches `len(token.encode('utf-8'))`)
	- Multi-split identifiers flagged with `multi_split: true`

	Notes:

	---

	#### 1.3 Zarr/Memmap Storage Layer
	- [ ] Implement zarr writer with chunking strategy `(layer, head)`
	- [ ] Create directory structure: `runs/{run_id}/tensors/`
	- [ ] Store attention, residuals, logits as zarr arrays
	- [ ] Implement lazy loading for frontend access

	Files: `/backend/storage.py` (new), `/backend/zarr_utils.py` (new)

	Acceptance Criteria:
	- Zarr arrays created with correct chunking
	- File size reasonable (< 500MB for 512 token generation with 32 layers)
	- Load time < 50ms for single layer/head slice
	- Compression ratio > 3x (use Blosc)

	Notes:

	---

	#### 1.4 Minimal API Endpoint `/analyze/study`
	- [ ] Create POST endpoint accepting prompt + generation params
	- [ ] Generate Run ID (format: `R{date}-{time}-{hash}`)
	- [ ] Implement deterministic generation (fixed seed)
	- [ ] Return minimal data contract JSON
	- [ ] Store telemetry (JSONL format)

	Files: `/backend/model_service.py`

	API Contract:
	```json
	POST /analyze/study
	{
	"prompt": "def factorial(n):",
	"max_tokens": 50,
	"seed": 42,
	"temperature": 0.0,
	"instrumentation": ["attention", "residuals", "tokenizer"]
	}

	Response:
	{
	"run_id": "R2025-11-01-1430-a7f3",
	"tokens": [...], // minimal data contract
	"tensor_path": "runs/R2025-11-01-1430-a7f3/tensors/",
	"telemetry_path": "runs/R2025-11-01-1430-a7f3/telemetry.jsonl"
	}
	```

	Acceptance Criteria:
	- Endpoint returns in < 5s for 50-token generation
	- Run ID is unique and reproducible with same seed
	- Telemetry JSONL created with `run.start` and `run.end` events
	- Tensors stored in zarr format

	Notes:

	---

	#### 1.5 Attention Rollout & Head Ranking
	- [ ] Implement attention rollout algorithm (Kovaleva-style)
	- [ ] Rank heads by rollout contribution (top-k = 20)
	- [ ] Store head rankings in Run ID metadata

	Files: `/backend/attention_analysis.py` (new)

	Acceptance Criteria:
	- Rollout matrix computed efficiently (< 100ms for 512 tokens)
	- Top-20 heads identified by max rollout weight
	- Rankings stored in `runs/{run_id}/metadata.json`

	Notes:

	---

	### Week 1-2 Acceptance Criteria (Overall)

	- [ ] All 5 tasks completed
	- [ ] Latency < 250ms for ≤512 tokens (measured end-to-end)
	- [ ] Zarr storage working correctly (can reload tensors)
	- [ ] API endpoint functional (manual test via curl/Postman)
	- [ ] Run ID reproducibility verified (same seed → same output)

	### Blockers

	- None yet

	### Decisions Made

	- 2025-11-01: Using zarr instead of HDF5 for better chunking and parallel access.

	---

	## Week 3: Attention Visualization

	Goal: Build interactive attention heatmap, head grid, and rollout toggle.

	Status: 🔴 Not Started

	### Tasks

	#### 3.1 Frontend: Attention Heatmap (WebGL)
	- [ ] Create `/components/study/AttentionVisualization.tsx`
	- [ ] Implement WebGL-based heatmap for performance
	- [ ] Add hover tooltips showing exact attention weights
	- [ ] Support aggregated (all heads) and per-head views

	Files: `/components/study/AttentionVisualization.tsx`

	Acceptance Criteria:
	- Renders 512x512 heatmap in < 100ms
	- Hover shows source token, target token, weight
	- Toggle between aggregated and per-head

	Notes:

	---

	#### 3.2 Frontend: Head Grid (Layer × Head Matrix)
	- [ ] Display Layer × Head grid with mini-sparklines
	- [ ] Show mean attention to token classes (identifiers, operators, etc.)
	- [ ] Click head → overlay on main heatmap

	Files: `/components/study/HeadGrid.tsx`

	Acceptance Criteria:
	- Grid renders 32×32 cells in < 50ms
	- Sparklines show attention distribution
	- Click interaction works smoothly

	Notes:

	---

	#### 3.3 Attention Rollout Toggle
	- [ ] Add toggle button: Raw Attention vs Rollout
	- [ ] Fetch rollout data from backend
	- [ ] Update heatmap dynamically

	Files: `/components/study/AttentionVisualization.tsx`

	Acceptance Criteria:
	- Toggle switches view in < 100ms
	- Rollout data fetched lazily (not on initial load)

	Notes:

	---

	#### 3.4 Interactions: Brush & Pin
	- [ ] Implement brush selection on context tokens
	- [ ] Highlight downstream tokens impacted by selection
	- [ ] Add "pin" button to save source→target pair for ablation

	Files: `/components/study/AttentionVisualization.tsx`

	Acceptance Criteria:
	- Brush selection responsive (< 50ms)
	- Pinned pairs visible in sidebar
	- Pin data passed to Ablation pane

	Notes:

	---

	#### 3.5 Disclaimer & Warnings
	- [ ] Add text: "Attention is descriptive; causal claims require ablation"
	- [ ] Warn if temperature > 1.2 or top-k sampling active

	Files: `/components/study/AttentionVisualization.tsx`

	Acceptance Criteria:
	- Disclaimer visible at top of pane
	- Warnings shown contextually

	Notes:

	---

	### Week 3 Acceptance Criteria (Overall)

	- [ ] Attention visualization fully functional
	- [ ] Interactive latency < 150ms for all operations
	- [ ] Cross-links to Ablation pane working
	- [ ] Manual test with Code Llama 7B (50-token generation)

	### Blockers

	### Decisions Made

	---

	## Week 4: Token Size & Confidence Visualization

	Goal: Build token chip bar, entropy sparkline, and risk hotspot flags.

	Status: 🔴 Not Started

	### Tasks

	#### 4.1 Frontend: Token Chip Bar
	- [ ] Create `/components/study/TokenConfidenceView.tsx`
	- [ ] Render tokens as chips: width = byte length, opacity = confidence
	- [ ] Add click handler to show tokenization + top-k alternatives

	Files: `/components/study/TokenConfidenceView.tsx`

	Acceptance Criteria:
	- Chips render correctly with variable widths
	- Opacity maps to confidence (1 - entropy or exp(logprob))
	- Click shows detailed panel

	Notes:

	---

	#### 4.2 Frontend: Entropy Sparkline
	- [ ] Add sparkline above/below token bar showing entropy per token
	- [ ] Highlight peaks (entropy ≥ τ_H, initially 1.5 nats)
	- [ ] Add calibration toggle (show thresholds for keywords/identifiers/operators)

	Files: `/components/study/TokenConfidenceView.tsx`

	Acceptance Criteria:
	- Sparkline renders in < 50ms
	- Peaks clearly visible
	- Threshold adjustable via slider

	Notes:

	---

	#### 4.3 Risk Hotspot Flags
	- [ ] Flag identifiers split into ≥3 subwords AND entropy peak
	- [ ] Display flag icon on token chips
	- [ ] Compute Bug-risk AUC (requires ground truth bug locations)

	Files: `/components/study/TokenConfidenceView.tsx`, `/backend/risk_analysis.py` (new)

	Acceptance Criteria:
	- Flags appear on relevant tokens
	- AUC metric computed (requires pilot data)

	Notes:

	---

	#### 4.4 Top-k Alternatives Panel
	- [ ] Show top-k alternatives with probabilities on token click
	- [ ] Display attention snippet (which context tokens justified each alternative)

	Files: `/components/study/TokenConfidenceView.tsx`

	Acceptance Criteria:
	- Panel shows top-3 alternatives minimum
	- Attention snippet links to Attention visualization

	Notes:

	---

	#### 4.5 Cost/Latency Estimator
	- [ ] Add widget showing cumulative decoding time
	- [ ] Estimate API cost (tokens × price per token)

	Files: `/components/study/TokenConfidenceView.tsx`

	Acceptance Criteria:
	- Time displayed in ms
	- Cost displayed in USD (or N/A for local)

	Notes:

	---

	### Week 4 Acceptance Criteria (Overall)

	- [ ] Token Size & Confidence view functional
	- [ ] Risk hotspots flagged correctly
	- [ ] Interactive latency < 150ms
	- [ ] Manual test with Code Llama 7B

	### Blockers

	### Decisions Made

	---

	## Week 5: Ablation Visualization

	Goal: Build interactive ablation controls with head toggles, layer bypass, and diff viewer.

	Status: 🔴 Not Started

	### Tasks

	#### 5.1 Backend: Ablation Engine
	- [ ] Implement head masking (zero out or uniform attention)
	- [ ] Implement layer bypass (skip layer, pass residual through)
	- [ ] Support token constraints (force/ban specific tokens)
	- [ ] Add surrogate regressor for predicted Δlog-prob

	Files: `/backend/ablation_engine.py` (new)

	Acceptance Criteria:
	- Ablation runs in < 3s for single head mask
	- Surrogate predictor accuracy > 70% (train on 100 samples)
	- Queue system for background ablation execution

	Notes:

	---

	#### 5.2 Frontend: Head Toggle Matrix
	- [ ] Create `/components/study/AblationView.tsx`
	- [ ] Display Layer × Head matrix with checkboxes
	- [ ] Show only top-20 heads (from Week 1-2 ranking)

	Files: `/components/study/AblationView.tsx`

	Acceptance Criteria:
	- Matrix renders in < 50ms
	- Checkboxes responsive
	- Selected heads highlighted

	Notes:

	---

	#### 5.3 Frontend: Diff Viewer
	- [ ] Show unified diff between baseline and ablated output
	- [ ] Highlight changed tokens (color-coded: added/removed/modified)
	- [ ] Display code-aware metrics (tests passed, AST parse, lints)

	Files: `/components/study/AblationView.tsx`

	Acceptance Criteria:
	- Diff renders clearly
	- Metrics displayed prominently
	- Color-coding accessible (colorblind-friendly)

	Notes:

	---

	#### 5.4 Frontend: Per-Token Delta Heat
	- [ ] Show Δlog-prob and Δentropy per token
	- [ ] Display as small multiples for most-impactful heads

	Files: `/components/study/AblationView.tsx`

	Acceptance Criteria:
	- Delta heat visible
	- Most-impactful heads identified (Δlog-prob ≥ τ_Δ)

	Notes:

	---

	#### 5.5 Integration with Attention View
	- [ ] Accept pinned source→target pairs from Attention view
	- [ ] Auto-suggest heads to ablate based on attention weights

	Files: `/components/study/AblationView.tsx`

	Acceptance Criteria:
	- Pinned pairs appear in Ablation pane
	- Suggested heads shown with explanation

	Notes:

	---

	### Week 5 Acceptance Criteria (Overall)

	- [ ] Ablation view functional
	- [ ] Head masking works correctly (verified with manual test)
	- [ ] Diff viewer shows meaningful changes
	- [ ] Code-aware metrics computed (AST, tests, lints)

	### Blockers

	### Decisions Made

	---

	## Week 6: Pipeline Visualization

	Goal: Build swimlane timeline with residual-z, entropy shift, and layer signals.

	Status: 🔴 Not Started

	### Tasks

	#### 6.1 Backend: Layer-Level Signals
	- [ ] Compute residual-norm z-scores
	- [ ] Compute entropy shift (pre vs post-layer)
	- [ ] Compute attention-flow saturation
	- [ ] Optional: router load for MoE models

	Files: `/backend/pipeline_analysis.py` (new)

	Acceptance Criteria:
	- Signals computed in < 50ms
	- Residual-z outliers flagged (> 2σ)
	- Entropy shifts tracked per layer

	Notes:

	---

	#### 6.2 Frontend: Swimlane Timeline
	- [ ] Create `/components/study/PipelineView.tsx`
	- [ ] Display lanes: Tokenizer → Embeddings → Layers → Logits → Sampler → Tests
	- [ ] Rectangle length = time per stage
	- [ ] Color intensity = uncertainty (entropy)

	Files: `/components/study/PipelineView.tsx`

	Acceptance Criteria:
	- Swimlane renders in < 100ms
	- Hover shows per-stage stats
	- Timeline scrubber works smoothly

	Notes:

	---

	#### 6.3 Layer Signal Overlays
	- [ ] Add overlays for residual-z, entropy shift, attention saturation
	- [ ] Toggle visibility of each signal
	- [ ] Highlight bottlenecks (top-q percentile of latency/residual-z)

	Files: `/components/study/PipelineView.tsx`

	Acceptance Criteria:
	- Overlays don't clutter visualization
	- Bottlenecks clearly marked
	- Toggle responsive

	Notes:

	---

	#### 6.4 Layer Bypass Interaction
	- [ ] Add controls to bypass ≤2 layers
	- [ ] Show predicted impact (via surrogate)
	- [ ] Execute queued ablation

	Files: `/components/study/PipelineView.tsx`

	Acceptance Criteria:
	- Bypass controls accessible
	- Predicted impact shown before execution
	- Ablation queued in background

	Notes:

	---

	#### 6.5 Cross-Links to Other Views
	- [ ] Click token → highlight in Attention and Token Confidence views
	- [ ] Integrated telemetry (track hover/click events)

	Files: `/components/study/PipelineView.tsx`

	Acceptance Criteria:
	- Cross-highlighting works
	- Telemetry logged

	Notes:

	---

	### Week 6 Acceptance Criteria (Overall)

	- [ ] Pipeline view functional
	- [ ] Layer signals computed correctly
	- [ ] Interactive latency < 150ms
	- [ ] Manual test with Code Llama 7B

	### Blockers

	### Decisions Made

	---

	## Week 7: Pilot Study (n=3)

	Goal: Run pilot with 3 participants; tune thresholds; validate latency; gather feedback.

	Status: 🔴 Not Started

	### Tasks

	#### 7.1 Recruit Pilot Participants
	- [ ] Identify 3 software engineers (varied experience levels)
	- [ ] Schedule 90-minute sessions

	Acceptance Criteria:
	- 3 participants confirmed
	- Availability scheduled

	Notes:

	---

	#### 7.2 Prepare Study Materials
	- [ ] Task T1: Code completion (sanitize_sql_like)
	- [ ] Task T2: Bug fix (reverse_string)
	- [ ] Pre-survey (demographics, LLM familiarity)
	- [ ] Post-task mini-survey (SCS, Trust, NASA-TLX)
	- [ ] Interview questions

	Files: `/docs/pilot-study-materials.md` (new)

	Acceptance Criteria:
	- Materials ready to distribute
	- Survey forms created (Google Forms or similar)

	Notes:

	---

	#### 7.3 Run Pilot Sessions
	- [ ] Session 1: Participant P01
	- [ ] Session 2: Participant P02
	- [ ] Session 3: Participant P03

	Acceptance Criteria:
	- All 3 sessions completed
	- Telemetry logged
	- Surveys completed

	Notes:

	---

	#### 7.4 Analyze Pilot Data & Tune Thresholds
	- [ ] Compute latency statistics (mean, p95)
	- [ ] Tune τ_H (entropy threshold) for ~90% specificity
	- [ ] Tune τ_Δ (log-prob delta) for ablation sensitivity
	- [ ] Tune τ_z (residual-norm outlier)

	Files: `/docs/pilot-analysis.md` (new)

	Acceptance Criteria:
	- Thresholds tuned based on pilot data
	- Latency < 250ms (if not, optimize)
	- Survey completion rate ≥ 90%

	Notes:

	---

	#### 7.5 Iterate on UX
	- [ ] Add tooltips/warnings based on pilot feedback
	- [ ] Fix any UX issues (confusing interactions, unclear labels)
	- [ ] Update documentation

	Acceptance Criteria:
	- At least 2 UX improvements implemented
	- Pilot participants' feedback documented

	Notes:

	---

	### Week 7 Acceptance Criteria (Overall)

	- [ ] Pilot study completed successfully
	- [ ] Thresholds tuned
	- [ ] Latency validated (< 250ms)
	- [ ] UX improvements identified and implemented

	### Blockers

	### Decisions Made

	---

	## Week 8: Main Study Preparation

	Goal: Finalize study tooling, prepare OSF pre-registration, and set up participant recruitment.

	Status: 🔴 Not Started

	### Tasks

	#### 8.1 Survey Integration
	- [ ] Integrate SUS, NASA-TLX, SCS scales into dashboard
	- [ ] Add pre-survey and post-task mini-surveys
	- [ ] Export survey data to CSV

	Files: `/components/study/SurveyModal.tsx` (new)

	Acceptance Criteria:
	- Surveys embedded in dashboard
	- Data exported correctly

	Notes:

	---

	#### 8.2 Latin Square Counterbalancing
	- [ ] Implement Latin square assignment for task order
	- [ ] Randomize condition order (Baseline vs Dashboard)

	Files: `/lib/study-randomization.ts` (new)

	Acceptance Criteria:
	- Counterbalancing correct (verified manually)
	- Participant assigned random ID (P01-P24)

	Notes:

	---

	#### 8.3 OSF Pre-Registration
	- [ ] Complete OSF template (Appendix D from spec)
	- [ ] Upload task stimuli, exclusion criteria
	- [ ] Submit pre-registration

	Files: `/docs/osf-preregistration.md` (copy of Appendix D)

	Acceptance Criteria:
	- Pre-registration submitted before main study
	- DOI obtained

	Notes:

	---

	#### 8.4 Export Artifact Bundle
	- [ ] Create script to package Run ID, tensors, telemetry
	- [ ] Generate `run_pack_P01.zip` for each participant
	- [ ] Test import into OSF

	Files: `/scripts/export_artifact.py` (new)

	Acceptance Criteria:
	- Export script functional
	- Bundle includes all necessary files
	- Bundle < 100MB per participant

	Notes:

	---

	#### 8.5 Participant Recruitment
	- [ ] Prepare recruitment email
	- [ ] Post to developer communities (Reddit, HackerNews, university mailing lists)
	- [ ] Target n=18-24 participants

	Acceptance Criteria:
	- Recruitment materials ready
	- At least 10 participants confirmed

	Notes:

	---

	### Week 8 Acceptance Criteria (Overall)

	- [ ] Study tooling finalized
	- [ ] OSF pre-registration submitted
	- [ ] Participant recruitment underway
	- [ ] Ready to begin main study (Week 9-10)

	### Blockers

	### Decisions Made

	---

	## Progress Summary

	\| Week \| Status \| Completion Date \| Notes \|
	\|------\|--------\|----------------\|-------\|
	\| Week 1-2: Instrumentation \| 🟡 In Progress \| - \| Started 2025-11-01 \|
	\| Week 3: Attention Viz \| 🔴 Not Started \| - \| - \|
	\| Week 4: Token Confidence Viz \| 🔴 Not Started \| - \| - \|
	\| Week 5: Ablation Viz \| 🔴 Not Started \| - \| - \|
	\| Week 6: Pipeline Viz \| 🔴 Not Started \| - \| - \|
	\| Week 7: Pilot Study \| 🔴 Not Started \| - \| - \|
	\| Week 8: Main Study Prep \| 🔴 Not Started \| - \| - \|

	Legend:
	- 🟢 Completed
	- 🟡 In Progress
	- 🔴 Not Started
	- 🔵 Blocked

	---

	## Global Blockers

	None currently

	---

	## Key Metrics (Target vs Actual)

	\| Metric \| Target \| Actual \| Status \|
	\|--------\|--------\|--------\|--------\|
	\| Initial render latency (≤512 tokens) \| < 250ms \| - \| - \|
	\| Interactive update latency \| < 150ms \| - \| - \|
	\| Zarr file size (512 tokens, 32 layers) \| < 500MB \| - \| - \|
	\| Zarr load time (single layer/head) \| < 50ms \| - \| - \|
	\| Attention rollout computation \| < 100ms \| - \| - \|
	\| Ablation execution time \| < 3s \| - \| - \|

	---

	## Notes & Decisions Log

	### 2025-11-01
	- Decision: Using zarr instead of HDF5 for tensor storage due to better chunking and parallel access.
	- Decision: Targeting top-k=20 heads for ablation UI (performance constraint).
	- Note: Started Week 1-2 instrumentation tasks.

	---

	End of Implementation Tracker