Spaces:

visualisable-ai
/

api

Running on CPU Upgrade

App Files Files Community

api / docs /implementation-tracker.md

gary-boon

Add research attention analysis endpoints with Q/K/V extraction

37ed739 about 1 month ago

preview code

raw

history blame contribute delete

19.3 kB

Implementation Tracker: Glass-Box Dashboard

Project: PhD Study - Making Architecture Transparent for Code Generation Timeline: 8 weeks (November 2025 - December 2025) Status: Week 1 - In Progress Last Updated: 2025-11-01

Overview

This document tracks progress through the 8-week implementation plan outlined in the PhD Study Specification. Each week has specific deliverables, acceptance criteria, and links to relevant code/files.

Week 1-2: Core Model Instrumentation

Goal: Implement PyTorch hooks, tokenizer instrumentation, zarr storage, and minimal API endpoint.

Status: 🟡 In Progress

Tasks

1.1 PyTorch Hooks for Attention & Residuals

Add forward hooks to capture attention tensors A[L,H,T,T]
Capture residual norms ||x_l|| per layer
Capture logits, logprobs, entropy per token
Record timing per layer (latency profiling)
Optional: FFN activations for future SAE integration

Files: /backend/model_service.py, /backend/instrumentation.py (new)

Acceptance Criteria:

Attention tensors stored with shape (num_layers, num_heads, seq_len, seq_len)
Residual norms array with shape (num_layers, seq_len)
Per-token metadata includes logprob, entropy, timing
Latency per layer < 10ms overhead on avg

Notes:

1.2 Tokenizer Instrumentation

Capture BPE/SentencePiece subword splits
Record byte length per token
Store token IDs and text
Identify multi-split identifiers (≥3 subwords)

Files: /backend/tokenizer_utils.py (new)

Acceptance Criteria:

Each token has bpe: [subword1, subword2, ...] field
Byte length calculated correctly (matches len(token.encode('utf-8')))
Multi-split identifiers flagged with multi_split: true

Notes:

1.3 Zarr/Memmap Storage Layer

Implement zarr writer with chunking strategy (layer, head)
Create directory structure: runs/{run_id}/tensors/
Store attention, residuals, logits as zarr arrays
Implement lazy loading for frontend access

Files: /backend/storage.py (new), /backend/zarr_utils.py (new)

Acceptance Criteria:

Zarr arrays created with correct chunking
File size reasonable (< 500MB for 512 token generation with 32 layers)
Load time < 50ms for single layer/head slice
Compression ratio > 3x (use Blosc)

Notes:

1.4 Minimal API Endpoint `/analyze/study`

Create POST endpoint accepting prompt + generation params
Generate Run ID (format: R{date}-{time}-{hash})
Implement deterministic generation (fixed seed)
Return minimal data contract JSON
Store telemetry (JSONL format)

Files: /backend/model_service.py

API Contract:

POST /analyze/study
{
  "prompt": "def factorial(n):",
  "max_tokens": 50,
  "seed": 42,
  "temperature": 0.0,
  "instrumentation": ["attention", "residuals", "tokenizer"]
}

Response:
{
  "run_id": "R2025-11-01-1430-a7f3",
  "tokens": [...],  // minimal data contract
  "tensor_path": "runs/R2025-11-01-1430-a7f3/tensors/",
  "telemetry_path": "runs/R2025-11-01-1430-a7f3/telemetry.jsonl"
}

Acceptance Criteria:

Endpoint returns in < 5s for 50-token generation
Run ID is unique and reproducible with same seed
Telemetry JSONL created with run.start and run.end events
Tensors stored in zarr format

Notes:

1.5 Attention Rollout & Head Ranking

Implement attention rollout algorithm (Kovaleva-style)
Rank heads by rollout contribution (top-k = 20)
Store head rankings in Run ID metadata

Files: /backend/attention_analysis.py (new)

Acceptance Criteria:

Rollout matrix computed efficiently (< 100ms for 512 tokens)
Top-20 heads identified by max rollout weight
Rankings stored in runs/{run_id}/metadata.json

Notes:

Week 1-2 Acceptance Criteria (Overall)

All 5 tasks completed
Latency < 250ms for ≤512 tokens (measured end-to-end)
Zarr storage working correctly (can reload tensors)
API endpoint functional (manual test via curl/Postman)
Run ID reproducibility verified (same seed → same output)

Blockers

None yet

Decisions Made

2025-11-01: Using zarr instead of HDF5 for better chunking and parallel access.

Week 3: Attention Visualization

Goal: Build interactive attention heatmap, head grid, and rollout toggle.

Status: 🔴 Not Started

Tasks

3.1 Frontend: Attention Heatmap (WebGL)

Create /components/study/AttentionVisualization.tsx
Implement WebGL-based heatmap for performance
Add hover tooltips showing exact attention weights
Support aggregated (all heads) and per-head views

Files: /components/study/AttentionVisualization.tsx

Acceptance Criteria:

Renders 512x512 heatmap in < 100ms
Hover shows source token, target token, weight
Toggle between aggregated and per-head

Notes:

3.2 Frontend: Head Grid (Layer × Head Matrix)

Display Layer × Head grid with mini-sparklines
Show mean attention to token classes (identifiers, operators, etc.)
Click head → overlay on main heatmap

Files: /components/study/HeadGrid.tsx

Acceptance Criteria:

Grid renders 32×32 cells in < 50ms
Sparklines show attention distribution
Click interaction works smoothly

Notes:

3.3 Attention Rollout Toggle

Add toggle button: Raw Attention vs Rollout
Fetch rollout data from backend
Update heatmap dynamically

Files: /components/study/AttentionVisualization.tsx

Acceptance Criteria:

Toggle switches view in < 100ms
Rollout data fetched lazily (not on initial load)

Notes:

3.4 Interactions: Brush & Pin

Implement brush selection on context tokens
Highlight downstream tokens impacted by selection
Add "pin" button to save source→target pair for ablation

Files: /components/study/AttentionVisualization.tsx

Acceptance Criteria:

Brush selection responsive (< 50ms)
Pinned pairs visible in sidebar
Pin data passed to Ablation pane

Notes:

3.5 Disclaimer & Warnings

Add text: "Attention is descriptive; causal claims require ablation"
Warn if temperature > 1.2 or top-k sampling active

Files: /components/study/AttentionVisualization.tsx

Acceptance Criteria:

Disclaimer visible at top of pane
Warnings shown contextually

Notes:

Week 3 Acceptance Criteria (Overall)

Attention visualization fully functional
Interactive latency < 150ms for all operations
Cross-links to Ablation pane working
Manual test with Code Llama 7B (50-token generation)

Blockers

Decisions Made

Week 4: Token Size & Confidence Visualization

Goal: Build token chip bar, entropy sparkline, and risk hotspot flags.

Status: 🔴 Not Started

Tasks

4.1 Frontend: Token Chip Bar

Create /components/study/TokenConfidenceView.tsx
Render tokens as chips: width = byte length, opacity = confidence
Add click handler to show tokenization + top-k alternatives

Files: /components/study/TokenConfidenceView.tsx

Acceptance Criteria:

Chips render correctly with variable widths
Opacity maps to confidence (1 - entropy or exp(logprob))
Click shows detailed panel

Notes:

4.2 Frontend: Entropy Sparkline

Add sparkline above/below token bar showing entropy per token
Highlight peaks (entropy ≥ τ_H, initially 1.5 nats)
Add calibration toggle (show thresholds for keywords/identifiers/operators)

Files: /components/study/TokenConfidenceView.tsx

Acceptance Criteria:

Sparkline renders in < 50ms
Peaks clearly visible
Threshold adjustable via slider

Notes:

4.3 Risk Hotspot Flags

Flag identifiers split into ≥3 subwords AND entropy peak
Display flag icon on token chips
Compute Bug-risk AUC (requires ground truth bug locations)

Files: /components/study/TokenConfidenceView.tsx, /backend/risk_analysis.py (new)

Acceptance Criteria:

Flags appear on relevant tokens
AUC metric computed (requires pilot data)

Notes:

4.4 Top-k Alternatives Panel

Show top-k alternatives with probabilities on token click
Display attention snippet (which context tokens justified each alternative)

Files: /components/study/TokenConfidenceView.tsx

Acceptance Criteria:

Panel shows top-3 alternatives minimum
Attention snippet links to Attention visualization

Notes:

4.5 Cost/Latency Estimator

Add widget showing cumulative decoding time
Estimate API cost (tokens × price per token)

Files: /components/study/TokenConfidenceView.tsx

Acceptance Criteria:

Time displayed in ms
Cost displayed in USD (or N/A for local)

Notes:

Week 4 Acceptance Criteria (Overall)

Token Size & Confidence view functional
Risk hotspots flagged correctly
Interactive latency < 150ms
Manual test with Code Llama 7B

Blockers

Decisions Made

Week 5: Ablation Visualization

Goal: Build interactive ablation controls with head toggles, layer bypass, and diff viewer.

Status: 🔴 Not Started

Tasks

5.1 Backend: Ablation Engine

Implement head masking (zero out or uniform attention)
Implement layer bypass (skip layer, pass residual through)
Support token constraints (force/ban specific tokens)
Add surrogate regressor for predicted Δlog-prob

Files: /backend/ablation_engine.py (new)

Acceptance Criteria:

Ablation runs in < 3s for single head mask
Surrogate predictor accuracy > 70% (train on 100 samples)
Queue system for background ablation execution

Notes:

5.2 Frontend: Head Toggle Matrix

Create /components/study/AblationView.tsx
Display Layer × Head matrix with checkboxes
Show only top-20 heads (from Week 1-2 ranking)

Files: /components/study/AblationView.tsx

Acceptance Criteria:

Matrix renders in < 50ms
Checkboxes responsive
Selected heads highlighted

Notes:

5.3 Frontend: Diff Viewer

Show unified diff between baseline and ablated output
Highlight changed tokens (color-coded: added/removed/modified)
Display code-aware metrics (tests passed, AST parse, lints)

Files: /components/study/AblationView.tsx

Acceptance Criteria:

Diff renders clearly
Metrics displayed prominently
Color-coding accessible (colorblind-friendly)

Notes:

5.4 Frontend: Per-Token Delta Heat

Show Δlog-prob and Δentropy per token
Display as small multiples for most-impactful heads

Files: /components/study/AblationView.tsx

Acceptance Criteria:

Delta heat visible
Most-impactful heads identified (Δlog-prob ≥ τ_Δ)

Notes:

5.5 Integration with Attention View

Accept pinned source→target pairs from Attention view
Auto-suggest heads to ablate based on attention weights

Files: /components/study/AblationView.tsx

Acceptance Criteria:

Pinned pairs appear in Ablation pane
Suggested heads shown with explanation

Notes:

Week 5 Acceptance Criteria (Overall)

Ablation view functional
Head masking works correctly (verified with manual test)
Diff viewer shows meaningful changes
Code-aware metrics computed (AST, tests, lints)

Blockers

Decisions Made

Week 6: Pipeline Visualization

Goal: Build swimlane timeline with residual-z, entropy shift, and layer signals.

Status: 🔴 Not Started

Tasks

6.1 Backend: Layer-Level Signals

Compute residual-norm z-scores
Compute entropy shift (pre vs post-layer)
Compute attention-flow saturation
Optional: router load for MoE models

Files: /backend/pipeline_analysis.py (new)

Acceptance Criteria:

Signals computed in < 50ms
Residual-z outliers flagged (> 2σ)
Entropy shifts tracked per layer

Notes:

6.2 Frontend: Swimlane Timeline

Create /components/study/PipelineView.tsx
Display lanes: Tokenizer → Embeddings → Layers → Logits → Sampler → Tests
Rectangle length = time per stage
Color intensity = uncertainty (entropy)

Files: /components/study/PipelineView.tsx

Acceptance Criteria:

Swimlane renders in < 100ms
Hover shows per-stage stats
Timeline scrubber works smoothly

Notes:

6.3 Layer Signal Overlays

Add overlays for residual-z, entropy shift, attention saturation
Toggle visibility of each signal
Highlight bottlenecks (top-q percentile of latency/residual-z)

Files: /components/study/PipelineView.tsx

Acceptance Criteria:

Overlays don't clutter visualization
Bottlenecks clearly marked
Toggle responsive

Notes:

6.4 Layer Bypass Interaction

Add controls to bypass ≤2 layers
Show predicted impact (via surrogate)
Execute queued ablation

Files: /components/study/PipelineView.tsx

Acceptance Criteria:

Bypass controls accessible
Predicted impact shown before execution
Ablation queued in background

Notes:

6.5 Cross-Links to Other Views

Click token → highlight in Attention and Token Confidence views
Integrated telemetry (track hover/click events)

Files: /components/study/PipelineView.tsx

Acceptance Criteria:

Cross-highlighting works
Telemetry logged

Notes:

Week 6 Acceptance Criteria (Overall)

Pipeline view functional
Layer signals computed correctly
Interactive latency < 150ms
Manual test with Code Llama 7B

Blockers

Decisions Made

Week 7: Pilot Study (n=3)

Goal: Run pilot with 3 participants; tune thresholds; validate latency; gather feedback.

Status: 🔴 Not Started

Tasks

7.1 Recruit Pilot Participants

Identify 3 software engineers (varied experience levels)
Schedule 90-minute sessions

Acceptance Criteria:

3 participants confirmed
Availability scheduled

Notes:

7.2 Prepare Study Materials

Task T1: Code completion (sanitize_sql_like)
Task T2: Bug fix (reverse_string)
Pre-survey (demographics, LLM familiarity)
Post-task mini-survey (SCS, Trust, NASA-TLX)
Interview questions

Files: /docs/pilot-study-materials.md (new)

Acceptance Criteria:

Materials ready to distribute
Survey forms created (Google Forms or similar)

Notes:

7.3 Run Pilot Sessions

Session 1: Participant P01
Session 2: Participant P02
Session 3: Participant P03

Acceptance Criteria:

All 3 sessions completed
Telemetry logged
Surveys completed

Notes:

7.4 Analyze Pilot Data & Tune Thresholds

Compute latency statistics (mean, p95)
Tune τ_H (entropy threshold) for ~90% specificity
Tune τ_Δ (log-prob delta) for ablation sensitivity
Tune τ_z (residual-norm outlier)

Files: /docs/pilot-analysis.md (new)

Acceptance Criteria:

Thresholds tuned based on pilot data
Latency < 250ms (if not, optimize)
Survey completion rate ≥ 90%

Notes:

7.5 Iterate on UX

Add tooltips/warnings based on pilot feedback
Fix any UX issues (confusing interactions, unclear labels)
Update documentation

Acceptance Criteria:

At least 2 UX improvements implemented
Pilot participants' feedback documented

Notes:

Week 7 Acceptance Criteria (Overall)

Pilot study completed successfully
Thresholds tuned
Latency validated (< 250ms)
UX improvements identified and implemented

Blockers

Decisions Made

Week 8: Main Study Preparation

Goal: Finalize study tooling, prepare OSF pre-registration, and set up participant recruitment.

Status: 🔴 Not Started

Tasks

8.1 Survey Integration

Integrate SUS, NASA-TLX, SCS scales into dashboard
Add pre-survey and post-task mini-surveys
Export survey data to CSV

Files: /components/study/SurveyModal.tsx (new)

Acceptance Criteria:

Surveys embedded in dashboard
Data exported correctly

Notes:

8.2 Latin Square Counterbalancing

Implement Latin square assignment for task order
Randomize condition order (Baseline vs Dashboard)

Files: /lib/study-randomization.ts (new)

Acceptance Criteria:

Counterbalancing correct (verified manually)
Participant assigned random ID (P01-P24)

Notes:

8.3 OSF Pre-Registration

Complete OSF template (Appendix D from spec)
Upload task stimuli, exclusion criteria
Submit pre-registration

Files: /docs/osf-preregistration.md (copy of Appendix D)

Acceptance Criteria:

Pre-registration submitted before main study
DOI obtained

Notes:

8.4 Export Artifact Bundle

Create script to package Run ID, tensors, telemetry
Generate run_pack_P01.zip for each participant
Test import into OSF

Files: /scripts/export_artifact.py (new)

Acceptance Criteria:

Export script functional
Bundle includes all necessary files
Bundle < 100MB per participant

Notes:

8.5 Participant Recruitment

Prepare recruitment email
Post to developer communities (Reddit, HackerNews, university mailing lists)
Target n=18-24 participants

Acceptance Criteria:

Recruitment materials ready
At least 10 participants confirmed

Notes:

Week 8 Acceptance Criteria (Overall)

Study tooling finalized
OSF pre-registration submitted
Participant recruitment underway
Ready to begin main study (Week 9-10)

Blockers

Decisions Made

Progress Summary

Week	Status	Completion Date	Notes
Week 1-2: Instrumentation	🟡 In Progress	-	Started 2025-11-01
Week 3: Attention Viz	🔴 Not Started	-	-
Week 4: Token Confidence Viz	🔴 Not Started	-	-
Week 5: Ablation Viz	🔴 Not Started	-	-
Week 6: Pipeline Viz	🔴 Not Started	-	-
Week 7: Pilot Study	🔴 Not Started	-	-
Week 8: Main Study Prep	🔴 Not Started	-	-

Legend:

🟢 Completed
🟡 In Progress
🔴 Not Started
🔵 Blocked

Global Blockers

None currently

Key Metrics (Target vs Actual)

Metric	Target	Actual	Status
Initial render latency (≤512 tokens)	< 250ms	-	-
Interactive update latency	< 150ms	-	-
Zarr file size (512 tokens, 32 layers)	< 500MB	-	-
Zarr load time (single layer/head)	< 50ms	-	-
Attention rollout computation	< 100ms	-	-
Ablation execution time	< 3s	-	-

Notes & Decisions Log

2025-11-01

Decision: Using zarr instead of HDF5 for tensor storage due to better chunking and parallel access.
Decision: Targeting top-k=20 heads for ablation UI (performance constraint).
Note: Started Week 1-2 instrumentation tasks.

End of Implementation Tracker

Implementation Tracker: Glass-Box Dashboard

Overview

Week 1-2: Core Model Instrumentation

Tasks

1.1 PyTorch Hooks for Attention & Residuals

1.2 Tokenizer Instrumentation

1.3 Zarr/Memmap Storage Layer

1.4 Minimal API Endpoint /analyze/study

1.5 Attention Rollout & Head Ranking

Week 1-2 Acceptance Criteria (Overall)

Blockers

Decisions Made

Week 3: Attention Visualization

Tasks

3.1 Frontend: Attention Heatmap (WebGL)

3.2 Frontend: Head Grid (Layer × Head Matrix)

3.3 Attention Rollout Toggle

3.4 Interactions: Brush & Pin

3.5 Disclaimer & Warnings

Week 3 Acceptance Criteria (Overall)

Blockers

Decisions Made

Week 4: Token Size & Confidence Visualization

Tasks

4.1 Frontend: Token Chip Bar

4.2 Frontend: Entropy Sparkline

4.3 Risk Hotspot Flags

4.4 Top-k Alternatives Panel

4.5 Cost/Latency Estimator

Week 4 Acceptance Criteria (Overall)

Blockers

Decisions Made

Week 5: Ablation Visualization

Tasks

5.1 Backend: Ablation Engine

5.2 Frontend: Head Toggle Matrix

5.3 Frontend: Diff Viewer

5.4 Frontend: Per-Token Delta Heat

5.5 Integration with Attention View

Week 5 Acceptance Criteria (Overall)

Blockers

Decisions Made

Week 6: Pipeline Visualization

Tasks

6.1 Backend: Layer-Level Signals

6.2 Frontend: Swimlane Timeline

6.3 Layer Signal Overlays

6.4 Layer Bypass Interaction

6.5 Cross-Links to Other Views

Week 6 Acceptance Criteria (Overall)

Blockers

Decisions Made

Week 7: Pilot Study (n=3)

Tasks

7.1 Recruit Pilot Participants

7.2 Prepare Study Materials

7.3 Run Pilot Sessions

7.4 Analyze Pilot Data & Tune Thresholds

7.5 Iterate on UX

Week 7 Acceptance Criteria (Overall)

Blockers

Decisions Made

Week 8: Main Study Preparation

Tasks

8.1 Survey Integration

8.2 Latin Square Counterbalancing

8.3 OSF Pre-Registration

8.4 Export Artifact Bundle

8.5 Participant Recruitment

Week 8 Acceptance Criteria (Overall)

Blockers

Decisions Made

Progress Summary

Global Blockers

Key Metrics (Target vs Actual)

Notes & Decisions Log

2025-11-01

1.4 Minimal API Endpoint `/analyze/study`