Spaces:

visualisable-ai
/

api

Sleeping

App Files Files Community

api / docs /phd-study-specification.md

gary-boon

Add research attention analysis endpoints with Q/K/V extraction

37ed739 about 1 month ago

preview code

raw

history blame contribute delete

20.9 kB

	# Glass‑Box Dashboard: Spec for 4 Visualisations (Attention • Token Size • Ablation • Pipeline)

	Alpha scope targeting Code Llama 7B; MoE routing optional. Designed to support ICML Paper 1 and RQ1.

	Version: 1.0
	Date: 2025-11-01
	Author: Gary Boon, Northumbria University
	Status: Implementation-ready specification

	---

	## 0) Shared principles & constraints

	* Determinism for study: fix `seed`, decoding params, checkpoint hash; log all knobs.
	* Latency budget: initial render < 250 ms for ≤512 tokens; interactive updates < 150 ms. Use lazy tensors + downsampling.
	* Reproducibility: every view binds to a Run ID; each action produces a Replay Script (YAML) to re‑execute generation/ablations.
	* Privacy: no proprietary code unless whitelisted; redact file paths; opt‑out for audio/screen capture.
	* Colour semantics: one consistent palette; uncertainty → desaturated; stronger evidence → higher opacity; avoid misleading rainbows.

	### Core model instrumentation (PyTorch/transformers hooks)

	* Capture per‑step: logits, logprobs, entropy; attention tensors `A[L,H,T,T]`; residual norms `\|\|x_l\|\|`; FFN activations (optional SAE features); KV‑cache hits; time per layer.
	* Store as memmap/`zarr` with chunking `(layer, head)` to keep interaction snappy.

	### Minimal data contract (per token `t_i`)

	```json
	{
	"id": 37,
	"text": "get_user",
	"bpe": ["get", "_", "user"],
	"byte_len": 8,
	"pos": 37,
	"logprob": -0.22,
	"entropy": 1.08,
	"topk": [{"tok":"(","p":0.21}, {"tok":"_","p":0.18}, {"tok":".","p":0.12}],
	"attn_in": {"layer": L, "head": H, "top_sources": [[pos, weight], ...]},
	"residual_norm": 3.7,
	"time_ms": 1.8
	}
	```

	---

	## 1) Attention Visualisation (descriptive; hypotheses validated via ablation)

	Purpose (RQ1): Make cross‑token influence legible; expose head roles; support causal what‑ifs.

	### Primary view

	* Token‑to‑token heatmap (rows = generated tokens, cols = prompt+context), aggregated or per‑head. Hover a token → highlight top‑k sources; tooltips show exact weights and source spans.
	* Head grid (Layer × Head matrix): mini‑sparklines per head showing mean attention to classes (delimiters, identifiers, comments). Click → overlays that head on main heatmap.
	* Rollout/flow toggle: attention rollout (Kovaleva‑style) vs raw attention.

	### Interactions

	* Brush source span in context → show downstream tokens most impacted (opacity ∝ weight).
	* Compare decode steps: scrub generation timeline; diff two steps to see shifting sources.
	* Evidence pinning: pin a pair (source→target) to the Ablation pane.
	* Recency bias flag: Highlight cases where >70% attention mass concentrates on last 5 tokens (recency bias indicator).

	### Algorithms & performance

	* Precompute per‑token top‑k sources (k=8). Downsample long contexts with landmark tokens (newline, punctuation, identifiers). WebGL canvas for heat.

	### Validity checks

	* Warn if softmax temperature >1.2 or top‑k sampling active (attention interpretability caveat). Display effective context length.

	Note: Attention visualisation is descriptive; causal claims require validation via ablation (Section 3).

	---

	## 2) Token Size & Confidence Visualisation

	Purpose: Reveal how tokenisation granularity (BPE/SentencePiece) interacts with model uncertainty to signal risk during code generation.

	### Primary view (Token Bar)

	* Sequence rendered as chips; width = byte length (or BPE merge depth), opacity = confidence (1−entropy) or `exp(logprob)`.
	* Top‑k alternatives on click (with probs) and the source attention snippet that justified each alternative.
	* Risk hotspot flags: identifiers split into ≥3 subwords and local entropy peaks.

	### Secondary widgets

	* Entropy sparkline with peaks labelled; toggle to show calibrated thresholds for code tokens (keywords/identifiers/operators may differ).
	* Cost/latency estimator: cumulative decoding time and estimated API‑cost (if remote).

	### Interactions

	* Click token → show tokenisation, entropy, top‑k; add as constraint to Ablation (force/ban token); jump to Attention sources.
	* Range‑select tokens → aggregate uncertainty and show correlated attention dispersion.

	### Metrics & study hooks

	* Bug‑risk AUC for hotspot flags vs actual error locations.
	* Correlation: token entropy vs unit‑test failure spans; pre‑reg threshold (e.g., entropy ≥ 1.5 nats).

	---

	## 3) Ablation Visualisation

	Purpose (causal): Show what changes when we disable parts of the architecture or constrain outputs.

	### Scope constraints (for interactivity)

	* Expose only top‑k heads (e.g., k=20) ranked by rollout/gradient contribution.
	* Allow layer bypass for ≤2 layers simultaneously.
	* Optional FFN gate clamp for a single layer.
	* Use a surrogate regressor to predict Δlog‑prob before running heavy re‑decodes; queue background executions.

	### Controls

	* Head toggles: Layer×Head matrix with checkboxes (mask to uniform/zero).
	* Layer bypass and token constraints (ban/force).
	* Decoding locks: temperature/top‑p pinned to baseline.

	### Outputs

	* Unified diff between baseline and ablated generation.
	* Code‑aware metrics: unit tests passed, AST parse success, static‑analysis warnings (ruff/bandit), and Δlog‑prob over altered spans.
	* Per‑token delta heat: Δlogprob/Δentropy; small multiples for most‑impactful heads.

	### Attribution ground truth (for study)

	A source token is influential for a generated token if (i) it lies in the top‑k rollout sources and (ii) masking the minimal set of heads that carry that source raises Δlog‑prob ≥ τ (e.g., 0.1) or flips a unit test outcome.

	---

	## 4) Pipeline Visualisation

	Purpose: Expose model pipeline and attribution of latency/uncertainty across stages using interpretable layer‑level signals, not raw neuron heatmaps.

	### Primary view (Swimlane/Timeline)

	* Lanes: Tokeniser → Embeddings → Layers (block‑stack) → Logits → Sampler → Post‑proc/Tests.
	* For each generated token: rectangles whose length reflects time per stage; colour intensity = uncertainty (entropy). Hover → per‑stage stats.

	### Layer‑level signals (per token or averaged)

	* Residual‑norm z‑scores across layers (outlier spikes flagged).
	* Entropy shift from pre‑ to post‑layer logits.
	* Attention‑flow saturation (% of attention mass concentrated on top‑m positions).
	* Router load if MoE: expert IDs + gate weights and imbalance.

	### Interactions

	* Click a token → cross‑highlight in Attention and Token Size & Confidence.
	* Layer bypass (≤2 at a time) to test where decisions crystallise; show predicted impact first, then execute queued ablation.

	### Operational definitions

	* Bottleneck = top‑q percentile of per‑layer latency or residual‑norm spikes; correlate with entropy jumps at the sampler.

	---

	## 5) Study mapping (tasks ↔ visualisations ↔ hypotheses)

	* T1 Code completion (5–15 LOC): Attention helps source‑of‑truth tracing; Token Size flags risky fragments; Ablation confirms causal role; Pipeline shows latency/entropy spikes.
	* T2 Bug fix from failing tests: Use Attention to localise misleading context; Ablation to test head responsibility; improved pass‑rate/time.
	* T3 API usage w/ docs: Token Size shows odd fragmentations of identifiers; Attention confirms copying from docs; Pipeline surfaces sampler uncertainty.

	### Measures

	* Primary: tests passed, time‑to‑pass, number of ablations invoked, SCS causability score, trust calibration (Brier).
	* Secondary: SUS for dashboard, NASA‑TLX, qualitative themes.

	---

	## 6) Telemetry & schema

	### Event types

	* `run.start\|end`, `token.emit`, `viz.attention.hover`, `viz.token_size.click`, `ablation.run`, `pipeline.hover`, `test.run`.

	### Minimal log rows

	```json
	{"event":"token.emit","run":"R2025-10-30-1342","i":37,"tok":"get_user","lp":-0.22,"H":1.08,"time_ms":1.8}
	{"event":"ablation.run","mask":[[12,3],[18,7]],"delta":{"tests":-2,"edit_dist":17}}
	```

	### Storage

	* Session JSONL + tensor store (zarr). Export bundle (Run ID, code, tensors, ablation scripts) for reproducibility.

	---

	## 7) Implementation plan (8‑week alpha)

	* Week 1–2 – Instrumentation: hooks for attention/residuals; tokenizer stats; timing per stage; zarr writer; minimal API. Add rollout and head ranking.
	* Week 3 – Attention view: heatmap (WebGL), head grid, rollout; cross‑links; disclaimer that attention is descriptive.
	* Week 4 – Token Size & Confidence view: chip bar, entropy sparkline, hotspot flags, top‑k.
	* Week 5 – Ablation view: mask top‑k heads/layers; surrogate predictor; diff viewer; code‑aware metrics.
	* Week 6 – Pipeline view: swimlane with residual‑z, entropy shift, saturation, latency; layer bypass (≤2).
	* Week 7 – Pilot study (n=3): tune thresholds (entropy τ, Δlog‑prob τ); validate latency; add warnings/tooltips.
	* Week 8 – Main study tooling: surveys, Latin‑square, OSF pre‑reg package, export artefact bundle.

	---

	## 8) Validity, pre‑registration & reproducibility

	* Validity note: Attention visualisation is descriptive; causal claims are only made when confirmed via ablation deltas.
	* Pre‑registration (OSF): include task pool, counterbalancing, metrics (AUC/Δlog‑prob/tests), exclusion criteria, mixed‑effects analysis, MDES.
	* Reproducibility: pin seed/checkpoint; publish tensors + telemetry (JSONL + zarr) and replay scripts; anonymise.

	---

	## 9) Study hypotheses (pre‑reg friendly)

	* H1‑Attn: Attention+rollout increases correct source identification vs baseline, verified by ablation (OR ≥ 1.8).
	* H2‑Tok: Entropy×token‑size hotspots predict bug locations (AUC ≥ 0.70) and reduce time‑to‑diagnosis.
	* H3‑Abl: Ablation tool reduces iterations to a passing solution by ≥20%.
	* H4‑Pipe: Pipeline summaries improve next‑token prediction and error localisation accuracy.

	---

	## 10) Measurement appendix (formulas)

	* Entropy: H = −∑_i p_i log p_i (nats). Threshold τ_H pre‑reg.
	* Residual‑norm z: z_l = (\|\|x_l\|\| − μ_l)/σ_l over corpus pilot.
	* Attention rollout: A_roll = softmax(A) composed across layers (Kovaleva‑style).
	* Attribution Δ: Δ = log p_baseline(tok) − log p_ablated(tok); influential if Δ ≥ τ_Δ.

	---

	## 11) Power & design guardrails

	* Within‑subjects, Latin square; difficulty buckets; record order, LLM familiarity, years' experience.
	* Plan for medium effect (d≈0.5): target n=18–24; if n≤12, emphasise large effects + rich qualitative analysis.

	---

	## Appendix A – Summary Table

	\| Visualization \| Opaque Mechanism \| Interpretable Representation \| Decision Signal (dev-relevant) \| Causal Check \|
	\|--------------\|------------------\|----------------------------\|--------------------------------\|--------------\|
	\| Attention \| Multi-head self-attention \| Token→token rollout heatmaps + head-role grid \| Which context spans steer each generated token; recency vs long-range use \| Verify via head mask ablations \|
	\| Token Size & Confidence \| Softmax over vocab + BPE splits \| Token chips: width=bytes, opacity=confidence, entropy sparkline, top-k \| Low-confidence identifiers/API calls; multi-split identifiers as risk \| Check error rate vs entropy peaks; ablate to flip token \|
	\| Ablation \| Component causality (heads/layers/FFN) \| Toggle masks + unified diff + Δtests/Δlog-prob \| Identify critical vs redundant components; localise bug sources \| Intrinsic causal by design \|
	\| Pipeline \| Layerwise transformation \| Layer timeline: residual-norm z, entropy shift, latency, (router load) \| Where decisions "crystallise"; where errors emerge \| Cross-check with layer bypass deltas \|

	---

	## Appendix B – Operational Thresholds

	\| Parameter \| Symbol \| Value (Initial) \| Tuning Method \|
	\|-----------\|--------\|----------------\|---------------\|
	\| Entropy threshold \| τ_H \| 1.5 nats \| Pilot study (n=3); calibrate to ~90% specificity \|
	\| Log-prob delta \| τ_Δ \| 0.1 \| Ablation sensitivity; adjust for model scale \|
	\| Residual-norm outlier \| τ_z \| 2.0 σ \| Corpus statistics from 100 samples \|
	\| Recency bias threshold \| - \| 70% \| Arbitrary; flag if >70% attention on last 5 tokens \|
	\| Top-k heads \| k \| 20 \| Performance constraint; expand if latency permits \|

	---

	## Appendix C – Technical Dependencies

	### Backend (Python)
	- PyTorch ≥ 2.0
	- transformers ≥ 4.30
	- zarr ≥ 2.14
	- numpy, scipy
	- fastapi, uvicorn

	### Frontend (Next.js)
	- React ≥18
	- D3.js or Plotly for visualizations
	- WebGL for attention heatmaps
	- TailwindCSS for styling

	### Storage
	- Zarr arrays for tensors (chunked by layer, head)
	- JSONL for telemetry
	- YAML for replay scripts

	---

	## Appendix D – OSF Pre‑Registration Template (Ready to Copy)

	Title: Making Transformer Architecture Transparent for Code Generation: A Developer‑Centric Study of Attention, Token Size & Confidence, Ablation, and Pipeline Visualisations

	Principal Investigator: Gary Boon (Northumbria University)

	Planned Registration Type: Pre‑Registration (Confirmatory)

	### 1. Research Questions and Hypotheses

	RQ1: How can we transform opaque architectural mechanisms into interpretable visual representations that reveal how LLMs make code‑generation decisions?

	Sub‑Hypotheses:
	- H1‑Attn: Attention+rollout increases correct source identification vs baseline, verified by ablation (OR ≥ 1.8).
	- H2‑Tok: Entropy×token‑size hotspots predict bug locations (AUC ≥ 0.70) and reduce time‑to‑diagnosis.
	- H3‑Abl: Ablation tool reduces iterations to a passing solution by ≥20%.
	- H4‑Pipe: Pipeline summaries improve next‑token prediction and error localisation accuracy.

	### 2. Design

	* Design Type: Within‑subjects, Latin square counterbalanced.
	* Conditions: Baseline (code inspection only) vs Glass‑Box Dashboard (with 4 visualizations).
	* Participants: n = 18–24 software engineers (2–10 years experience).
	* Tasks: T1 Code completion (5-15 LOC), T2 Bug fixing from failing tests, T3 API usage with documentation.
	* Covariates: LLM familiarity (1-7 scale), order (A→B vs B→A), programming language proficiency, years of experience.

	### 3. Materials and Stimuli

	* Model: Code Llama 7B FP16 (specific checkpoint hash recorded).
	* Visualisations: Attention (heatmap + head grid), Token Size & Confidence (chip bar + entropy sparkline), Ablation (toggle masks + diff), Pipeline (swimlane timeline).
	* Unit‑test harness: pytest with pre-written test suites.
	* AST/lint tools: Python `ast` module, ruff, bandit for static analysis.

	### 4. Procedure

	1. Consent + pre‑survey (10 min): demographics, LLM use frequency, programming experience.
	2. Tutorial on dashboard (15 min): guided walkthrough of each visualization with example.
	3. Task blocks (40 min): counterbalanced order (Latin square); 2-3 tasks per condition.
	4. Post‑task mini‑survey (5 min): SCS (System Causability Scale), Trust scale, NASA‑TLX.
	5. Semi-structured interview (15 min): qualitative feedback on visualizations, workflow integration.
	6. Final SUS (5 min): System Usability Scale for dashboard.

	Total time: ~90 minutes per participant.

	### 5. Planned Analyses

	Quantitative:
	- Mixed‑effects models: condition × task + random intercepts for participant/task.
	- Metrics: Δlog‑prob (ablation impact), tests passed, time‑to‑fix, AUC(Entropy × Token Size hotspot predictor), OR(H1 - source identification accuracy).
	- Software: R (lme4) or Python (statsmodels).

	Qualitative:
	- Thematic analysis: Braun & Clarke (2021) 6-phase approach.
	- Coding: Two researchers independently code transcripts; resolve disagreements via discussion.
	- Themes: Mental model formation, trust calibration, workflow integration, visualization utility.

	### 6. Power Analysis

	* Effect size target: d = 0.5 (medium effect, Cohen's conventions).
	* α = 0.05, power = 0.8 → n ≈ 21 paired observations (within-subjects).
	* Planned n = 18-24 to account for dropouts and provide adequate power.

	### 7. Data Management

	* Telemetry: JSONL event logs + zarr tensor storage.
	* Audio/screen captures: stored on separate encrypted volume; opt-out available.
	* Anonymization: Participant IDs (P01-P24); redact file paths, proprietary code.
	* Publication: Anonymised artifacts (Run ID bundles, telemetry, survey data) published on OSF upon paper acceptance.

	### 8. Ethics and Risk

	* Approval: Northumbria University Ethics Protocol v1.3 (Interpretability Studies).
	* Risk level: Minimal. Participants can opt-out anytime; no deception involved.
	* Compensation: £25 Amazon voucher per participant.

	### 9. Exclusion Criteria

	* Pre-registered:
	- < 2 years professional programming experience
	- No Python proficiency (self-reported < 4/7)
	- Previous participation in pilot study (n=3)
	- Incomplete task completion (<50% of tasks)

	### 10. Timeline

	* Pilot study (n=3): Week 7 of implementation (threshold tuning).
	* Pre-registration submission: End of Week 7 (before main study).
	* Main study (n=18-24): Week 8-10.
	* Analysis & write-up: Week 11-16.

	---

	## Appendix E – Pilot Pack

	### E1. Task T1 – Code Completion

	Prompt: "Write a Python function `sanitize_sql_like(pattern: str)` that escapes SQL LIKE wildcards (%, _) and backslashes."

	Ground Truth Outline:

	```python
	def sanitize_sql_like(pattern: str) -> str:
	pattern = pattern.replace("\\", "\\\\")
	pattern = pattern.replace("%", "\\%")
	pattern = pattern.replace("_", "\\_")
	return pattern
	```

	Unit Tests (`tests/test_sanitize.py`):

	```python
	from main import sanitize_sql_like
	import pytest

	def test_escape_percent():
	assert sanitize_sql_like("100%") == "100\\%"

	def test_escape_underscore():
	assert sanitize_sql_like("user_name") == "user\\_name"

	def test_double_escape():
	assert sanitize_sql_like("C:\\path%") == "C:\\\\path\\%"
	```

	### E2. Task T2 – Bug Fix (Localisation)

	Prompt: "This function should reverse a string recursively. Find and fix the bug."

	```python
	def reverse_string(s: str) -> str:
	if len(s) == 1:
	return s
	return s[0] + reverse_string(s[1:])
	```

	Expected fix: `return reverse_string(s[1:]) + s[0]`

	Unit Tests (`tests/test_reverse.py`):

	```python
	from main import reverse_string

	def test_simple():
	assert reverse_string("abc") == "cba"

	def test_empty():
	assert reverse_string("") == ""
	```

	### E3. Mini‑Survey Items (Per Task)

	7-point Likert scale (1=Strongly Disagree, 7=Strongly Agree):

	1. I could explain why the model produced this output.
	2. I trusted the model's output appropriately.
	3. My workload was high for this task.
	4. The visualisations were useful for this task.
	5. My confidence was well‑calibrated to the code's correctness.

	### E4. Pilot Checklist

	- [ ] Latency < 300 ms mean for ≤512 tokens.
	- [ ] Entropy threshold τ_H tuned (~1.5 nats).
	- [ ] Δlog‑prob threshold τ_Δ tuned (~0.1).
	- [ ] Verify unit tests pass/fail recorded correctly.
	- [ ] Survey completion rate ≥ 90%.
	- [ ] Qualitative feedback indicates visualizations are understandable.

	### E5. Output Artefacts

	Per participant:
	- `run_pack_P01.zip` → Run ID, tensors (zarr), logs (JSONL), test results, survey responses.
	- Import into OSF for data availability statement.

	Aggregate:
	- `pilot_summary.csv` → Metrics, thresholds, latency stats.
	- `pilot_feedback.md` → Qualitative themes, suggested improvements.

	---

	## References

	- Jain, S., & Wallace, B. C. (2019). Attention is not Explanation. NAACL.
	- Kou, Z., et al. (2024). Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code? FSE.
	- Paltenghi, M., et al. (2022). Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration. arXiv.
	- Zheng, H., et al. (2025). Attention Heads of Large Language Models: A Survey. arXiv.
	- Zhao, H., et al. (2024). Explainability for Large Language Models: A Survey. ACM Digital Library.
	- Braun, V., & Clarke, V. (2021). Thematic Analysis: A Practical Guide. SAGE Publications.
	- Wang, K., et al. (2022). Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 small. arXiv.

	---

	## Document History

	\| Version \| Date \| Changes \| Author \|
	\|---------\|------\|---------\|--------\|
	\| 1.0 \| 2025-11-01 \| Initial specification document \| Gary Boon \|

	---

	End of Specification Document