Spaces:

visualisable-ai
/

api

Sleeping

api

File size: 19,316 Bytes

37ed739

# Implementation Tracker: Glass-Box Dashboard

**Project:** PhD Study - Making Architecture Transparent for Code Generation
**Timeline:** 8 weeks (November 2025 - December 2025)
**Status:** Week 1 - In Progress
**Last Updated:** 2025-11-01

---

## Overview

This document tracks progress through the 8-week implementation plan outlined in the PhD Study Specification. Each week has specific deliverables, acceptance criteria, and links to relevant code/files.

---

## Week 1-2: Core Model Instrumentation

**Goal:** Implement PyTorch hooks, tokenizer instrumentation, zarr storage, and minimal API endpoint.

**Status:** 🟡 In Progress

### Tasks

#### 1.1 PyTorch Hooks for Attention & Residuals
- [ ] Add forward hooks to capture attention tensors `A[L,H,T,T]`
- [ ] Capture residual norms `||x_l||` per layer
- [ ] Capture logits, logprobs, entropy per token
- [ ] Record timing per layer (latency profiling)
- [ ] Optional: FFN activations for future SAE integration

**Files:** `/backend/model_service.py`, `/backend/instrumentation.py` (new)

**Acceptance Criteria:**
- Attention tensors stored with shape (num_layers, num_heads, seq_len, seq_len)
- Residual norms array with shape (num_layers, seq_len)
- Per-token metadata includes logprob, entropy, timing
- Latency per layer < 10ms overhead on avg

**Notes:**

---

#### 1.2 Tokenizer Instrumentation
- [ ] Capture BPE/SentencePiece subword splits
- [ ] Record byte length per token
- [ ] Store token IDs and text
- [ ] Identify multi-split identifiers (≥3 subwords)

**Files:** `/backend/tokenizer_utils.py` (new)

**Acceptance Criteria:**
- Each token has `bpe: [subword1, subword2, ...]` field
- Byte length calculated correctly (matches `len(token.encode('utf-8'))`)
- Multi-split identifiers flagged with `multi_split: true`

**Notes:**

---

#### 1.3 Zarr/Memmap Storage Layer
- [ ] Implement zarr writer with chunking strategy `(layer, head)`
- [ ] Create directory structure: `runs/{run_id}/tensors/`
- [ ] Store attention, residuals, logits as zarr arrays
- [ ] Implement lazy loading for frontend access

**Files:** `/backend/storage.py` (new), `/backend/zarr_utils.py` (new)

**Acceptance Criteria:**
- Zarr arrays created with correct chunking
- File size reasonable (< 500MB for 512 token generation with 32 layers)
- Load time < 50ms for single layer/head slice
- Compression ratio > 3x (use Blosc)

**Notes:**

---

#### 1.4 Minimal API Endpoint `/analyze/study`
- [ ] Create POST endpoint accepting prompt + generation params
- [ ] Generate Run ID (format: `R{date}-{time}-{hash}`)
- [ ] Implement deterministic generation (fixed seed)
- [ ] Return minimal data contract JSON
- [ ] Store telemetry (JSONL format)

**Files:** `/backend/model_service.py`

**API Contract:**
```json
POST /analyze/study
{
  "prompt": "def factorial(n):",
  "max_tokens": 50,
  "seed": 42,
  "temperature": 0.0,
  "instrumentation": ["attention", "residuals", "tokenizer"]
}

Response:
{
  "run_id": "R2025-11-01-1430-a7f3",
  "tokens": [...],  // minimal data contract
  "tensor_path": "runs/R2025-11-01-1430-a7f3/tensors/",
  "telemetry_path": "runs/R2025-11-01-1430-a7f3/telemetry.jsonl"
}
```

**Acceptance Criteria:**
- Endpoint returns in < 5s for 50-token generation
- Run ID is unique and reproducible with same seed
- Telemetry JSONL created with `run.start` and `run.end` events
- Tensors stored in zarr format

**Notes:**

---

#### 1.5 Attention Rollout & Head Ranking
- [ ] Implement attention rollout algorithm (Kovaleva-style)
- [ ] Rank heads by rollout contribution (top-k = 20)
- [ ] Store head rankings in Run ID metadata

**Files:** `/backend/attention_analysis.py` (new)

**Acceptance Criteria:**
- Rollout matrix computed efficiently (< 100ms for 512 tokens)
- Top-20 heads identified by max rollout weight
- Rankings stored in `runs/{run_id}/metadata.json`

**Notes:**

---

### Week 1-2 Acceptance Criteria (Overall)

- [ ] All 5 tasks completed
- [ ] Latency < 250ms for ≤512 tokens (measured end-to-end)
- [ ] Zarr storage working correctly (can reload tensors)
- [ ] API endpoint functional (manual test via curl/Postman)
- [ ] Run ID reproducibility verified (same seed → same output)

### Blockers

- **None yet**

### Decisions Made

- **2025-11-01:** Using zarr instead of HDF5 for better chunking and parallel access.

---

## Week 3: Attention Visualization

**Goal:** Build interactive attention heatmap, head grid, and rollout toggle.

**Status:** 🔴 Not Started

### Tasks

#### 3.1 Frontend: Attention Heatmap (WebGL)
- [ ] Create `/components/study/AttentionVisualization.tsx`
- [ ] Implement WebGL-based heatmap for performance
- [ ] Add hover tooltips showing exact attention weights
- [ ] Support aggregated (all heads) and per-head views

**Files:** `/components/study/AttentionVisualization.tsx`

**Acceptance Criteria:**
- Renders 512x512 heatmap in < 100ms
- Hover shows source token, target token, weight
- Toggle between aggregated and per-head

**Notes:**

---

#### 3.2 Frontend: Head Grid (Layer × Head Matrix)
- [ ] Display Layer × Head grid with mini-sparklines
- [ ] Show mean attention to token classes (identifiers, operators, etc.)
- [ ] Click head → overlay on main heatmap

**Files:** `/components/study/HeadGrid.tsx`

**Acceptance Criteria:**
- Grid renders 32×32 cells in < 50ms
- Sparklines show attention distribution
- Click interaction works smoothly

**Notes:**

---

#### 3.3 Attention Rollout Toggle
- [ ] Add toggle button: Raw Attention vs Rollout
- [ ] Fetch rollout data from backend
- [ ] Update heatmap dynamically

**Files:** `/components/study/AttentionVisualization.tsx`

**Acceptance Criteria:**
- Toggle switches view in < 100ms
- Rollout data fetched lazily (not on initial load)

**Notes:**

---

#### 3.4 Interactions: Brush & Pin
- [ ] Implement brush selection on context tokens
- [ ] Highlight downstream tokens impacted by selection
- [ ] Add "pin" button to save source→target pair for ablation

**Files:** `/components/study/AttentionVisualization.tsx`

**Acceptance Criteria:**
- Brush selection responsive (< 50ms)
- Pinned pairs visible in sidebar
- Pin data passed to Ablation pane

**Notes:**

---

#### 3.5 Disclaimer & Warnings
- [ ] Add text: "Attention is descriptive; causal claims require ablation"
- [ ] Warn if temperature > 1.2 or top-k sampling active

**Files:** `/components/study/AttentionVisualization.tsx`

**Acceptance Criteria:**
- Disclaimer visible at top of pane
- Warnings shown contextually

**Notes:**

---

### Week 3 Acceptance Criteria (Overall)

- [ ] Attention visualization fully functional
- [ ] Interactive latency < 150ms for all operations
- [ ] Cross-links to Ablation pane working
- [ ] Manual test with Code Llama 7B (50-token generation)

### Blockers

### Decisions Made

---

## Week 4: Token Size & Confidence Visualization

**Goal:** Build token chip bar, entropy sparkline, and risk hotspot flags.

**Status:** 🔴 Not Started

### Tasks

#### 4.1 Frontend: Token Chip Bar
- [ ] Create `/components/study/TokenConfidenceView.tsx`
- [ ] Render tokens as chips: width = byte length, opacity = confidence
- [ ] Add click handler to show tokenization + top-k alternatives

**Files:** `/components/study/TokenConfidenceView.tsx`

**Acceptance Criteria:**
- Chips render correctly with variable widths
- Opacity maps to confidence (1 - entropy or exp(logprob))
- Click shows detailed panel

**Notes:**

---

#### 4.2 Frontend: Entropy Sparkline
- [ ] Add sparkline above/below token bar showing entropy per token
- [ ] Highlight peaks (entropy ≥ τ_H, initially 1.5 nats)
- [ ] Add calibration toggle (show thresholds for keywords/identifiers/operators)

**Files:** `/components/study/TokenConfidenceView.tsx`

**Acceptance Criteria:**
- Sparkline renders in < 50ms
- Peaks clearly visible
- Threshold adjustable via slider

**Notes:**

---

#### 4.3 Risk Hotspot Flags
- [ ] Flag identifiers split into ≥3 subwords AND entropy peak
- [ ] Display flag icon on token chips
- [ ] Compute Bug-risk AUC (requires ground truth bug locations)

**Files:** `/components/study/TokenConfidenceView.tsx`, `/backend/risk_analysis.py` (new)

**Acceptance Criteria:**
- Flags appear on relevant tokens
- AUC metric computed (requires pilot data)

**Notes:**

---

#### 4.4 Top-k Alternatives Panel
- [ ] Show top-k alternatives with probabilities on token click
- [ ] Display attention snippet (which context tokens justified each alternative)

**Files:** `/components/study/TokenConfidenceView.tsx`

**Acceptance Criteria:**
- Panel shows top-3 alternatives minimum
- Attention snippet links to Attention visualization

**Notes:**

---

#### 4.5 Cost/Latency Estimator
- [ ] Add widget showing cumulative decoding time
- [ ] Estimate API cost (tokens × price per token)

**Files:** `/components/study/TokenConfidenceView.tsx`

**Acceptance Criteria:**
- Time displayed in ms
- Cost displayed in USD (or N/A for local)

**Notes:**

---

### Week 4 Acceptance Criteria (Overall)

- [ ] Token Size & Confidence view functional
- [ ] Risk hotspots flagged correctly
- [ ] Interactive latency < 150ms
- [ ] Manual test with Code Llama 7B

### Blockers

### Decisions Made

---

## Week 5: Ablation Visualization

**Goal:** Build interactive ablation controls with head toggles, layer bypass, and diff viewer.

**Status:** 🔴 Not Started

### Tasks

#### 5.1 Backend: Ablation Engine
- [ ] Implement head masking (zero out or uniform attention)
- [ ] Implement layer bypass (skip layer, pass residual through)
- [ ] Support token constraints (force/ban specific tokens)
- [ ] Add surrogate regressor for predicted Δlog-prob

**Files:** `/backend/ablation_engine.py` (new)

**Acceptance Criteria:**
- Ablation runs in < 3s for single head mask
- Surrogate predictor accuracy > 70% (train on 100 samples)
- Queue system for background ablation execution

**Notes:**

---

#### 5.2 Frontend: Head Toggle Matrix
- [ ] Create `/components/study/AblationView.tsx`
- [ ] Display Layer × Head matrix with checkboxes
- [ ] Show only top-20 heads (from Week 1-2 ranking)

**Files:** `/components/study/AblationView.tsx`

**Acceptance Criteria:**
- Matrix renders in < 50ms
- Checkboxes responsive
- Selected heads highlighted

**Notes:**

---

#### 5.3 Frontend: Diff Viewer
- [ ] Show unified diff between baseline and ablated output
- [ ] Highlight changed tokens (color-coded: added/removed/modified)
- [ ] Display code-aware metrics (tests passed, AST parse, lints)

**Files:** `/components/study/AblationView.tsx`

**Acceptance Criteria:**
- Diff renders clearly
- Metrics displayed prominently
- Color-coding accessible (colorblind-friendly)

**Notes:**

---

#### 5.4 Frontend: Per-Token Delta Heat
- [ ] Show Δlog-prob and Δentropy per token
- [ ] Display as small multiples for most-impactful heads

**Files:** `/components/study/AblationView.tsx`

**Acceptance Criteria:**
- Delta heat visible
- Most-impactful heads identified (Δlog-prob ≥ τ_Δ)

**Notes:**

---

#### 5.5 Integration with Attention View
- [ ] Accept pinned source→target pairs from Attention view
- [ ] Auto-suggest heads to ablate based on attention weights

**Files:** `/components/study/AblationView.tsx`

**Acceptance Criteria:**
- Pinned pairs appear in Ablation pane
- Suggested heads shown with explanation

**Notes:**

---

### Week 5 Acceptance Criteria (Overall)

- [ ] Ablation view functional
- [ ] Head masking works correctly (verified with manual test)
- [ ] Diff viewer shows meaningful changes
- [ ] Code-aware metrics computed (AST, tests, lints)

### Blockers

### Decisions Made

---

## Week 6: Pipeline Visualization

**Goal:** Build swimlane timeline with residual-z, entropy shift, and layer signals.

**Status:** 🔴 Not Started

### Tasks

#### 6.1 Backend: Layer-Level Signals
- [ ] Compute residual-norm z-scores
- [ ] Compute entropy shift (pre vs post-layer)
- [ ] Compute attention-flow saturation
- [ ] Optional: router load for MoE models

**Files:** `/backend/pipeline_analysis.py` (new)

**Acceptance Criteria:**
- Signals computed in < 50ms
- Residual-z outliers flagged (> 2σ)
- Entropy shifts tracked per layer

**Notes:**

---

#### 6.2 Frontend: Swimlane Timeline
- [ ] Create `/components/study/PipelineView.tsx`
- [ ] Display lanes: Tokenizer → Embeddings → Layers → Logits → Sampler → Tests
- [ ] Rectangle length = time per stage
- [ ] Color intensity = uncertainty (entropy)

**Files:** `/components/study/PipelineView.tsx`

**Acceptance Criteria:**
- Swimlane renders in < 100ms
- Hover shows per-stage stats
- Timeline scrubber works smoothly

**Notes:**

---

#### 6.3 Layer Signal Overlays
- [ ] Add overlays for residual-z, entropy shift, attention saturation
- [ ] Toggle visibility of each signal
- [ ] Highlight bottlenecks (top-q percentile of latency/residual-z)

**Files:** `/components/study/PipelineView.tsx`

**Acceptance Criteria:**
- Overlays don't clutter visualization
- Bottlenecks clearly marked
- Toggle responsive

**Notes:**

---

#### 6.4 Layer Bypass Interaction
- [ ] Add controls to bypass ≤2 layers
- [ ] Show predicted impact (via surrogate)
- [ ] Execute queued ablation

**Files:** `/components/study/PipelineView.tsx`

**Acceptance Criteria:**
- Bypass controls accessible
- Predicted impact shown before execution
- Ablation queued in background

**Notes:**

---

#### 6.5 Cross-Links to Other Views
- [ ] Click token → highlight in Attention and Token Confidence views
- [ ] Integrated telemetry (track hover/click events)

**Files:** `/components/study/PipelineView.tsx`

**Acceptance Criteria:**
- Cross-highlighting works
- Telemetry logged

**Notes:**

---

### Week 6 Acceptance Criteria (Overall)

- [ ] Pipeline view functional
- [ ] Layer signals computed correctly
- [ ] Interactive latency < 150ms
- [ ] Manual test with Code Llama 7B

### Blockers

### Decisions Made

---

## Week 7: Pilot Study (n=3)

**Goal:** Run pilot with 3 participants; tune thresholds; validate latency; gather feedback.

**Status:** 🔴 Not Started

### Tasks

#### 7.1 Recruit Pilot Participants
- [ ] Identify 3 software engineers (varied experience levels)
- [ ] Schedule 90-minute sessions

**Acceptance Criteria:**
- 3 participants confirmed
- Availability scheduled

**Notes:**

---

#### 7.2 Prepare Study Materials
- [ ] Task T1: Code completion (sanitize_sql_like)
- [ ] Task T2: Bug fix (reverse_string)
- [ ] Pre-survey (demographics, LLM familiarity)
- [ ] Post-task mini-survey (SCS, Trust, NASA-TLX)
- [ ] Interview questions

**Files:** `/docs/pilot-study-materials.md` (new)

**Acceptance Criteria:**
- Materials ready to distribute
- Survey forms created (Google Forms or similar)

**Notes:**

---

#### 7.3 Run Pilot Sessions
- [ ] Session 1: Participant P01
- [ ] Session 2: Participant P02
- [ ] Session 3: Participant P03

**Acceptance Criteria:**
- All 3 sessions completed
- Telemetry logged
- Surveys completed

**Notes:**

---

#### 7.4 Analyze Pilot Data & Tune Thresholds
- [ ] Compute latency statistics (mean, p95)
- [ ] Tune τ_H (entropy threshold) for ~90% specificity
- [ ] Tune τ_Δ (log-prob delta) for ablation sensitivity
- [ ] Tune τ_z (residual-norm outlier)

**Files:** `/docs/pilot-analysis.md` (new)

**Acceptance Criteria:**
- Thresholds tuned based on pilot data
- Latency < 250ms (if not, optimize)
- Survey completion rate ≥ 90%

**Notes:**

---

#### 7.5 Iterate on UX
- [ ] Add tooltips/warnings based on pilot feedback
- [ ] Fix any UX issues (confusing interactions, unclear labels)
- [ ] Update documentation

**Acceptance Criteria:**
- At least 2 UX improvements implemented
- Pilot participants' feedback documented

**Notes:**

---

### Week 7 Acceptance Criteria (Overall)

- [ ] Pilot study completed successfully
- [ ] Thresholds tuned
- [ ] Latency validated (< 250ms)
- [ ] UX improvements identified and implemented

### Blockers

### Decisions Made

---

## Week 8: Main Study Preparation

**Goal:** Finalize study tooling, prepare OSF pre-registration, and set up participant recruitment.

**Status:** 🔴 Not Started

### Tasks

#### 8.1 Survey Integration
- [ ] Integrate SUS, NASA-TLX, SCS scales into dashboard
- [ ] Add pre-survey and post-task mini-surveys
- [ ] Export survey data to CSV

**Files:** `/components/study/SurveyModal.tsx` (new)

**Acceptance Criteria:**
- Surveys embedded in dashboard
- Data exported correctly

**Notes:**

---

#### 8.2 Latin Square Counterbalancing
- [ ] Implement Latin square assignment for task order
- [ ] Randomize condition order (Baseline vs Dashboard)

**Files:** `/lib/study-randomization.ts` (new)

**Acceptance Criteria:**
- Counterbalancing correct (verified manually)
- Participant assigned random ID (P01-P24)

**Notes:**

---

#### 8.3 OSF Pre-Registration
- [ ] Complete OSF template (Appendix D from spec)
- [ ] Upload task stimuli, exclusion criteria
- [ ] Submit pre-registration

**Files:** `/docs/osf-preregistration.md` (copy of Appendix D)

**Acceptance Criteria:**
- Pre-registration submitted before main study
- DOI obtained

**Notes:**

---

#### 8.4 Export Artifact Bundle
- [ ] Create script to package Run ID, tensors, telemetry
- [ ] Generate `run_pack_P01.zip` for each participant
- [ ] Test import into OSF

**Files:** `/scripts/export_artifact.py` (new)

**Acceptance Criteria:**
- Export script functional
- Bundle includes all necessary files
- Bundle < 100MB per participant

**Notes:**

---

#### 8.5 Participant Recruitment
- [ ] Prepare recruitment email
- [ ] Post to developer communities (Reddit, HackerNews, university mailing lists)
- [ ] Target n=18-24 participants

**Acceptance Criteria:**
- Recruitment materials ready
- At least 10 participants confirmed

**Notes:**

---

### Week 8 Acceptance Criteria (Overall)

- [ ] Study tooling finalized
- [ ] OSF pre-registration submitted
- [ ] Participant recruitment underway
- [ ] Ready to begin main study (Week 9-10)

### Blockers

### Decisions Made

---

## Progress Summary

| Week | Status | Completion Date | Notes |
|------|--------|----------------|-------|
| Week 1-2: Instrumentation | 🟡 In Progress | - | Started 2025-11-01 |
| Week 3: Attention Viz | 🔴 Not Started | - | - |
| Week 4: Token Confidence Viz | 🔴 Not Started | - | - |
| Week 5: Ablation Viz | 🔴 Not Started | - | - |
| Week 6: Pipeline Viz | 🔴 Not Started | - | - |
| Week 7: Pilot Study | 🔴 Not Started | - | - |
| Week 8: Main Study Prep | 🔴 Not Started | - | - |

**Legend:**
- 🟢 Completed
- 🟡 In Progress
- 🔴 Not Started
- 🔵 Blocked

---

## Global Blockers

*None currently*

---

## Key Metrics (Target vs Actual)

| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| Initial render latency (≤512 tokens) | < 250ms | - | - |
| Interactive update latency | < 150ms | - | - |
| Zarr file size (512 tokens, 32 layers) | < 500MB | - | - |
| Zarr load time (single layer/head) | < 50ms | - | - |
| Attention rollout computation | < 100ms | - | - |
| Ablation execution time | < 3s | - | - |

---

## Notes & Decisions Log

### 2025-11-01
- **Decision:** Using zarr instead of HDF5 for tensor storage due to better chunking and parallel access.
- **Decision:** Targeting top-k=20 heads for ablation UI (performance constraint).
- **Note:** Started Week 1-2 instrumentation tasks.

---

**End of Implementation Tracker**