api / docs /phd-study-specification.md
gary-boon
Add research attention analysis endpoints with Q/K/V extraction
37ed739
# Glass‑Box Dashboard: Spec for 4 Visualisations (Attention • Token Size • Ablation • Pipeline)
*Alpha scope targeting Code Llama 7B; MoE routing optional. Designed to support ICML Paper 1 and RQ1.*
**Version:** 1.0
**Date:** 2025-11-01
**Author:** Gary Boon, Northumbria University
**Status:** Implementation-ready specification
---
## 0) Shared principles & constraints
* **Determinism for study:** fix `seed`, decoding params, checkpoint hash; log all knobs.
* **Latency budget:** initial render < 250 ms for ≤512 tokens; interactive updates < 150 ms. Use lazy tensors + downsampling.
* **Reproducibility:** every view binds to a **Run ID**; each action produces a **Replay Script** (YAML) to re‑execute generation/ablations.
* **Privacy:** no proprietary code unless whitelisted; redact file paths; opt‑out for audio/screen capture.
* **Colour semantics:** one consistent palette; uncertainty → desaturated; stronger evidence → higher opacity; avoid misleading rainbows.
### Core model instrumentation (PyTorch/transformers hooks)
* Capture per‑step: logits, logprobs, entropy; attention tensors `A[L,H,T,T]`; residual norms `||x_l||`; FFN activations (optional SAE features); KV‑cache hits; time per layer.
* Store as memmap/`zarr` with chunking `(layer, head)` to keep interaction snappy.
### Minimal data contract (per token `t_i`)
```json
{
"id": 37,
"text": "get_user",
"bpe": ["get", "_", "user"],
"byte_len": 8,
"pos": 37,
"logprob": -0.22,
"entropy": 1.08,
"topk": [{"tok":"(","p":0.21}, {"tok":"_","p":0.18}, {"tok":".","p":0.12}],
"attn_in": {"layer": L, "head": H, "top_sources": [[pos, weight], ...]},
"residual_norm": 3.7,
"time_ms": 1.8
}
```
---
## 1) Attention Visualisation *(descriptive; hypotheses validated via ablation)*
**Purpose (RQ1):** Make cross‑token influence legible; expose head roles; support causal what‑ifs.
### Primary view
* **Token‑to‑token heatmap** (rows = generated tokens, cols = prompt+context), aggregated or per‑head. Hover a token → highlight top‑k sources; tooltips show exact weights and source spans.
* **Head grid** (Layer × Head matrix): mini‑sparklines per head showing mean attention to classes (delimiters, identifiers, comments). Click → overlays that head on main heatmap.
* **Rollout/flow toggle:** attention rollout (Kovaleva‑style) vs raw attention.
### Interactions
* **Brush source span** in context → show downstream tokens most impacted (opacity ∝ weight).
* **Compare decode steps:** scrub generation timeline; diff two steps to see shifting sources.
* **Evidence pinning:** pin a pair (source→target) to the **Ablation** pane.
* **Recency bias flag:** Highlight cases where >70% attention mass concentrates on last 5 tokens (recency bias indicator).
### Algorithms & performance
* Precompute per‑token top‑k sources (k=8). Downsample long contexts with landmark tokens (newline, punctuation, identifiers). WebGL canvas for heat.
### Validity checks
* Warn if softmax temperature >1.2 or top‑k sampling active (attention interpretability caveat). Display effective context length.
**Note:** Attention visualisation is **descriptive**; causal claims require validation via ablation (Section 3).
---
## 2) Token Size & Confidence Visualisation
**Purpose:** Reveal how tokenisation granularity (BPE/SentencePiece) interacts with model uncertainty to signal risk during code generation.
### Primary view (Token Bar)
* Sequence rendered as **chips**; **width** = byte length (or BPE merge depth), **opacity** = confidence (1−entropy) or `exp(logprob)`.
* **Top‑k alternatives** on click (with probs) and the **source attention snippet** that justified each alternative.
* **Risk hotspot flags:** identifiers split into **≥3 subwords** *and* local **entropy peaks**.
### Secondary widgets
* **Entropy sparkline** with peaks labelled; toggle to show **calibrated** thresholds for code tokens (keywords/identifiers/operators may differ).
* **Cost/latency estimator:** cumulative decoding time and estimated API‑cost (if remote).
### Interactions
* Click token → show tokenisation, entropy, top‑k; add as constraint to **Ablation** (force/ban token); jump to **Attention** sources.
* Range‑select tokens → aggregate uncertainty and show correlated attention dispersion.
### Metrics & study hooks
* **Bug‑risk AUC** for hotspot flags vs actual error locations.
* **Correlation**: token entropy vs unit‑test failure spans; pre‑reg threshold (e.g., entropy ≥ 1.5 nats).
---
## 3) Ablation Visualisation
**Purpose (causal):** Show what changes when we disable parts of the architecture or constrain outputs.
### Scope constraints (for interactivity)
* Expose only **top‑k heads** (e.g., k=20) ranked by rollout/gradient contribution.
* Allow **layer bypass** for ≤2 layers simultaneously.
* Optional **FFN gate clamp** for a single layer.
* Use a **surrogate regressor** to predict Δlog‑prob before running heavy re‑decodes; queue background executions.
### Controls
* **Head toggles**: Layer×Head matrix with checkboxes (mask to uniform/zero).
* **Layer bypass** and **token constraints** (ban/force).
* **Decoding locks**: temperature/top‑p pinned to baseline.
### Outputs
* **Unified diff** between baseline and ablated generation.
* **Code‑aware metrics:** unit tests passed, **AST parse success**, static‑analysis warnings (ruff/bandit), and **Δlog‑prob** over altered spans.
* **Per‑token delta heat**: Δlogprob/Δentropy; small multiples for most‑impactful heads.
### Attribution ground truth (for study)
A source token is influential for a generated token if (i) it lies in the top‑k rollout sources **and** (ii) masking the minimal set of heads that carry that source raises Δlog‑prob ≥ τ (e.g., 0.1) or flips a unit test outcome.
---
## 4) Pipeline Visualisation
**Purpose:** Expose model pipeline and attribution of latency/uncertainty across stages using **interpretable layer‑level signals**, not raw neuron heatmaps.
### Primary view (Swimlane/Timeline)
* Lanes: **Tokeniser → Embeddings → Layers (block‑stack) → Logits → Sampler → Post‑proc/Tests**.
* For each generated token: rectangles whose **length** reflects time per stage; colour intensity = uncertainty (entropy). Hover → per‑stage stats.
### Layer‑level signals (per token or averaged)
* **Residual‑norm z‑scores** across layers (outlier spikes flagged).
* **Entropy shift** from pre‑ to post‑layer logits.
* **Attention‑flow saturation** (% of attention mass concentrated on top‑m positions).
* **Router load** if MoE: expert IDs + gate weights and imbalance.
### Interactions
* Click a token → cross‑highlight in **Attention** and **Token Size & Confidence**.
* **Layer bypass** (≤2 at a time) to test where decisions crystallise; show predicted impact first, then execute queued ablation.
### Operational definitions
* **Bottleneck** = top‑q percentile of per‑layer latency or residual‑norm spikes; correlate with entropy jumps at the sampler.
---
## 5) Study mapping (tasks ↔ visualisations ↔ hypotheses)
* **T1 Code completion (5–15 LOC):** Attention helps source‑of‑truth tracing; Token Size flags risky fragments; Ablation confirms causal role; Pipeline shows latency/entropy spikes.
* **T2 Bug fix from failing tests:** Use Attention to localise misleading context; Ablation to test head responsibility; improved pass‑rate/time.
* **T3 API usage w/ docs:** Token Size shows odd fragmentations of identifiers; Attention confirms copying from docs; Pipeline surfaces sampler uncertainty.
### Measures
* Primary: tests passed, time‑to‑pass, number of ablations invoked, SCS causability score, trust calibration (Brier).
* Secondary: SUS for dashboard, NASA‑TLX, qualitative themes.
---
## 6) Telemetry & schema
### Event types
* `run.start|end`, `token.emit`, `viz.attention.hover`, `viz.token_size.click`, `ablation.run`, `pipeline.hover`, `test.run`.
### Minimal log rows
```json
{"event":"token.emit","run":"R2025-10-30-1342","i":37,"tok":"get_user","lp":-0.22,"H":1.08,"time_ms":1.8}
{"event":"ablation.run","mask":[[12,3],[18,7]],"delta":{"tests":-2,"edit_dist":17}}
```
### Storage
* Session JSONL + tensor store (zarr). Export bundle (Run ID, code, tensors, ablation scripts) for reproducibility.
---
## 7) Implementation plan (8‑week alpha)
* **Week 1–2 – Instrumentation**: hooks for attention/residuals; tokenizer stats; timing per stage; zarr writer; minimal API. Add rollout and head ranking.
* **Week 3 – Attention view**: heatmap (WebGL), head grid, rollout; cross‑links; disclaimer that attention is descriptive.
* **Week 4 – Token Size & Confidence view**: chip bar, entropy sparkline, hotspot flags, top‑k.
* **Week 5 – Ablation view**: mask top‑k heads/layers; surrogate predictor; diff viewer; code‑aware metrics.
* **Week 6 – Pipeline view**: swimlane with residual‑z, entropy shift, saturation, latency; layer bypass (≤2).
* **Week 7 – Pilot study (n=3)**: tune thresholds (entropy τ, Δlog‑prob τ); validate latency; add warnings/tooltips.
* **Week 8 – Main study tooling**: surveys, Latin‑square, OSF pre‑reg package, export artefact bundle.
---
## 8) Validity, pre‑registration & reproducibility
* **Validity note:** Attention visualisation is **descriptive**; causal claims are only made when confirmed via **ablation deltas**.
* **Pre‑registration (OSF):** include task pool, counterbalancing, metrics (AUC/Δlog‑prob/tests), exclusion criteria, mixed‑effects analysis, MDES.
* **Reproducibility:** pin seed/checkpoint; publish tensors + telemetry (JSONL + zarr) and replay scripts; anonymise.
---
## 9) Study hypotheses (pre‑reg friendly)
* **H1‑Attn:** Attention+rollout increases correct source identification vs baseline, verified by ablation (OR ≥ 1.8).
* **H2‑Tok:** Entropy×token‑size hotspots predict bug locations (AUC ≥ 0.70) and reduce time‑to‑diagnosis.
* **H3‑Abl:** Ablation tool reduces iterations to a passing solution by ≥20%.
* **H4‑Pipe:** Pipeline summaries improve next‑token prediction and error localisation accuracy.
---
## 10) Measurement appendix (formulas)
* **Entropy**: H = −∑_i p_i log p_i (nats). Threshold τ_H pre‑reg.
* **Residual‑norm z**: z_l = (||x_l|| − μ_l)/σ_l over corpus pilot.
* **Attention rollout**: A_roll = softmax(A) composed across layers (Kovaleva‑style).
* **Attribution Δ**: Δ = log p_baseline(tok) − log p_ablated(tok); influential if Δ ≥ τ_Δ.
---
## 11) Power & design guardrails
* Within‑subjects, Latin square; difficulty buckets; record order, LLM familiarity, years' experience.
* Plan for **medium effect** (d≈0.5): target n=18–24; if n≤12, emphasise large effects + rich qualitative analysis.
---
## Appendix A – Summary Table
| Visualization | Opaque Mechanism | Interpretable Representation | Decision Signal (dev-relevant) | Causal Check |
|--------------|------------------|----------------------------|--------------------------------|--------------|
| **Attention** | Multi-head self-attention | Token→token rollout heatmaps + head-role grid | Which context spans steer each generated token; recency vs long-range use | Verify via head mask ablations |
| **Token Size & Confidence** | Softmax over vocab + BPE splits | Token chips: width=bytes, opacity=confidence, entropy sparkline, top-k | Low-confidence identifiers/API calls; multi-split identifiers as risk | Check error rate vs entropy peaks; ablate to flip token |
| **Ablation** | Component causality (heads/layers/FFN) | Toggle masks + unified diff + Δtests/Δlog-prob | Identify critical vs redundant components; localise bug sources | Intrinsic causal by design |
| **Pipeline** | Layerwise transformation | Layer timeline: residual-norm z, entropy shift, latency, (router load) | Where decisions "crystallise"; where errors emerge | Cross-check with layer bypass deltas |
---
## Appendix B – Operational Thresholds
| Parameter | Symbol | Value (Initial) | Tuning Method |
|-----------|--------|----------------|---------------|
| Entropy threshold | τ_H | 1.5 nats | Pilot study (n=3); calibrate to ~90% specificity |
| Log-prob delta | τ_Δ | 0.1 | Ablation sensitivity; adjust for model scale |
| Residual-norm outlier | τ_z | 2.0 σ | Corpus statistics from 100 samples |
| Recency bias threshold | - | 70% | Arbitrary; flag if >70% attention on last 5 tokens |
| Top-k heads | k | 20 | Performance constraint; expand if latency permits |
---
## Appendix C – Technical Dependencies
### Backend (Python)
- PyTorch ≥ 2.0
- transformers ≥ 4.30
- zarr ≥ 2.14
- numpy, scipy
- fastapi, uvicorn
### Frontend (Next.js)
- React ≥18
- D3.js or Plotly for visualizations
- WebGL for attention heatmaps
- TailwindCSS for styling
### Storage
- Zarr arrays for tensors (chunked by layer, head)
- JSONL for telemetry
- YAML for replay scripts
---
## Appendix D – OSF Pre‑Registration Template (Ready to Copy)
**Title:** Making Transformer Architecture Transparent for Code Generation: A Developer‑Centric Study of Attention, Token Size & Confidence, Ablation, and Pipeline Visualisations
**Principal Investigator:** Gary Boon (Northumbria University)
**Planned Registration Type:** Pre‑Registration (Confirmatory)
### 1. Research Questions and Hypotheses
**RQ1:** How can we transform opaque architectural mechanisms into interpretable visual representations that reveal how LLMs make code‑generation decisions?
**Sub‑Hypotheses:**
- **H1‑Attn:** Attention+rollout increases correct source identification vs baseline, verified by ablation (OR ≥ 1.8).
- **H2‑Tok:** Entropy×token‑size hotspots predict bug locations (AUC ≥ 0.70) and reduce time‑to‑diagnosis.
- **H3‑Abl:** Ablation tool reduces iterations to a passing solution by ≥20%.
- **H4‑Pipe:** Pipeline summaries improve next‑token prediction and error localisation accuracy.
### 2. Design
* **Design Type:** Within‑subjects, Latin square counterbalanced.
* **Conditions:** Baseline (code inspection only) vs Glass‑Box Dashboard (with 4 visualizations).
* **Participants:** n = 18–24 software engineers (2–10 years experience).
* **Tasks:** T1 Code completion (5-15 LOC), T2 Bug fixing from failing tests, T3 API usage with documentation.
* **Covariates:** LLM familiarity (1-7 scale), order (A→B vs B→A), programming language proficiency, years of experience.
### 3. Materials and Stimuli
* **Model:** Code Llama 7B FP16 (specific checkpoint hash recorded).
* **Visualisations:** Attention (heatmap + head grid), Token Size & Confidence (chip bar + entropy sparkline), Ablation (toggle masks + diff), Pipeline (swimlane timeline).
* **Unit‑test harness:** pytest with pre-written test suites.
* **AST/lint tools:** Python `ast` module, ruff, bandit for static analysis.
### 4. Procedure
1. **Consent + pre‑survey** (10 min): demographics, LLM use frequency, programming experience.
2. **Tutorial on dashboard** (15 min): guided walkthrough of each visualization with example.
3. **Task blocks** (40 min): counterbalanced order (Latin square); 2-3 tasks per condition.
4. **Post‑task mini‑survey** (5 min): SCS (System Causability Scale), Trust scale, NASA‑TLX.
5. **Semi-structured interview** (15 min): qualitative feedback on visualizations, workflow integration.
6. **Final SUS** (5 min): System Usability Scale for dashboard.
**Total time:** ~90 minutes per participant.
### 5. Planned Analyses
**Quantitative:**
- **Mixed‑effects models:** condition × task + random intercepts for participant/task.
- **Metrics:** Δlog‑prob (ablation impact), tests passed, time‑to‑fix, AUC(Entropy × Token Size hotspot predictor), OR(H1 - source identification accuracy).
- **Software:** R (lme4) or Python (statsmodels).
**Qualitative:**
- **Thematic analysis:** Braun & Clarke (2021) 6-phase approach.
- **Coding:** Two researchers independently code transcripts; resolve disagreements via discussion.
- **Themes:** Mental model formation, trust calibration, workflow integration, visualization utility.
### 6. Power Analysis
* **Effect size target:** d = 0.5 (medium effect, Cohen's conventions).
* **α = 0.05, power = 0.8** → n ≈ 21 paired observations (within-subjects).
* **Planned n = 18-24** to account for dropouts and provide adequate power.
### 7. Data Management
* **Telemetry:** JSONL event logs + zarr tensor storage.
* **Audio/screen captures:** stored on separate encrypted volume; opt-out available.
* **Anonymization:** Participant IDs (P01-P24); redact file paths, proprietary code.
* **Publication:** Anonymised artifacts (Run ID bundles, telemetry, survey data) published on OSF upon paper acceptance.
### 8. Ethics and Risk
* **Approval:** Northumbria University Ethics Protocol v1.3 (Interpretability Studies).
* **Risk level:** Minimal. Participants can opt-out anytime; no deception involved.
* **Compensation:** £25 Amazon voucher per participant.
### 9. Exclusion Criteria
* **Pre-registered:**
- < 2 years professional programming experience
- No Python proficiency (self-reported < 4/7)
- Previous participation in pilot study (n=3)
- Incomplete task completion (<50% of tasks)
### 10. Timeline
* **Pilot study (n=3):** Week 7 of implementation (threshold tuning).
* **Pre-registration submission:** End of Week 7 (before main study).
* **Main study (n=18-24):** Week 8-10.
* **Analysis & write-up:** Week 11-16.
---
## Appendix E – Pilot Pack
### E1. Task T1 – Code Completion
**Prompt:** "Write a Python function `sanitize_sql_like(pattern: str)` that escapes SQL LIKE wildcards (%, _) and backslashes."
**Ground Truth Outline:**
```python
def sanitize_sql_like(pattern: str) -> str:
pattern = pattern.replace("\\", "\\\\")
pattern = pattern.replace("%", "\\%")
pattern = pattern.replace("_", "\\_")
return pattern
```
**Unit Tests (`tests/test_sanitize.py`):**
```python
from main import sanitize_sql_like
import pytest
def test_escape_percent():
assert sanitize_sql_like("100%") == "100\\%"
def test_escape_underscore():
assert sanitize_sql_like("user_name") == "user\\_name"
def test_double_escape():
assert sanitize_sql_like("C:\\path%") == "C:\\\\path\\%"
```
### E2. Task T2 – Bug Fix (Localisation)
**Prompt:** "This function should reverse a string recursively. Find and fix the bug."
```python
def reverse_string(s: str) -> str:
if len(s) == 1:
return s
return s[0] + reverse_string(s[1:])
```
**Expected fix:** `return reverse_string(s[1:]) + s[0]`
**Unit Tests (`tests/test_reverse.py`):**
```python
from main import reverse_string
def test_simple():
assert reverse_string("abc") == "cba"
def test_empty():
assert reverse_string("") == ""
```
### E3. Mini‑Survey Items (Per Task)
**7-point Likert scale (1=Strongly Disagree, 7=Strongly Agree):**
1. I could explain why the model produced this output.
2. I trusted the model's output appropriately.
3. My workload was high for this task.
4. The visualisations were useful for this task.
5. My confidence was well‑calibrated to the code's correctness.
### E4. Pilot Checklist
- [ ] Latency < 300 ms mean for ≤512 tokens.
- [ ] Entropy threshold τ_H tuned (~1.5 nats).
- [ ] Δlog‑prob threshold τ_Δ tuned (~0.1).
- [ ] Verify unit tests pass/fail recorded correctly.
- [ ] Survey completion rate ≥ 90%.
- [ ] Qualitative feedback indicates visualizations are understandable.
### E5. Output Artefacts
**Per participant:**
- `run_pack_P01.zip` → Run ID, tensors (zarr), logs (JSONL), test results, survey responses.
- Import into OSF for data availability statement.
**Aggregate:**
- `pilot_summary.csv` → Metrics, thresholds, latency stats.
- `pilot_feedback.md` → Qualitative themes, suggested improvements.
---
## References
- **Jain, S., & Wallace, B. C. (2019).** Attention is not Explanation. *NAACL*.
- **Kou, Z., et al. (2024).** Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code? *FSE*.
- **Paltenghi, M., et al. (2022).** Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration. *arXiv*.
- **Zheng, H., et al. (2025).** Attention Heads of Large Language Models: A Survey. *arXiv*.
- **Zhao, H., et al. (2024).** Explainability for Large Language Models: A Survey. *ACM Digital Library*.
- **Braun, V., & Clarke, V. (2021).** Thematic Analysis: A Practical Guide. *SAGE Publications*.
- **Wang, K., et al. (2022).** Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 small. *arXiv*.
---
## Document History
| Version | Date | Changes | Author |
|---------|------|---------|--------|
| 1.0 | 2025-11-01 | Initial specification document | Gary Boon |
---
**End of Specification Document**