Spaces:
Sleeping
Glass‑Box Dashboard: Spec for 4 Visualisations (Attention • Token Size • Ablation • Pipeline)
Alpha scope targeting Code Llama 7B; MoE routing optional. Designed to support ICML Paper 1 and RQ1.
Version: 1.0 Date: 2025-11-01 Author: Gary Boon, Northumbria University Status: Implementation-ready specification
0) Shared principles & constraints
- Determinism for study: fix
seed, decoding params, checkpoint hash; log all knobs. - Latency budget: initial render < 250 ms for ≤512 tokens; interactive updates < 150 ms. Use lazy tensors + downsampling.
- Reproducibility: every view binds to a Run ID; each action produces a Replay Script (YAML) to re‑execute generation/ablations.
- Privacy: no proprietary code unless whitelisted; redact file paths; opt‑out for audio/screen capture.
- Colour semantics: one consistent palette; uncertainty → desaturated; stronger evidence → higher opacity; avoid misleading rainbows.
Core model instrumentation (PyTorch/transformers hooks)
- Capture per‑step: logits, logprobs, entropy; attention tensors
A[L,H,T,T]; residual norms||x_l||; FFN activations (optional SAE features); KV‑cache hits; time per layer. - Store as memmap/
zarrwith chunking(layer, head)to keep interaction snappy.
Minimal data contract (per token t_i)
{
"id": 37,
"text": "get_user",
"bpe": ["get", "_", "user"],
"byte_len": 8,
"pos": 37,
"logprob": -0.22,
"entropy": 1.08,
"topk": [{"tok":"(","p":0.21}, {"tok":"_","p":0.18}, {"tok":".","p":0.12}],
"attn_in": {"layer": L, "head": H, "top_sources": [[pos, weight], ...]},
"residual_norm": 3.7,
"time_ms": 1.8
}
1) Attention Visualisation (descriptive; hypotheses validated via ablation)
Purpose (RQ1): Make cross‑token influence legible; expose head roles; support causal what‑ifs.
Primary view
- Token‑to‑token heatmap (rows = generated tokens, cols = prompt+context), aggregated or per‑head. Hover a token → highlight top‑k sources; tooltips show exact weights and source spans.
- Head grid (Layer × Head matrix): mini‑sparklines per head showing mean attention to classes (delimiters, identifiers, comments). Click → overlays that head on main heatmap.
- Rollout/flow toggle: attention rollout (Kovaleva‑style) vs raw attention.
Interactions
- Brush source span in context → show downstream tokens most impacted (opacity ∝ weight).
- Compare decode steps: scrub generation timeline; diff two steps to see shifting sources.
- Evidence pinning: pin a pair (source→target) to the Ablation pane.
- Recency bias flag: Highlight cases where >70% attention mass concentrates on last 5 tokens (recency bias indicator).
Algorithms & performance
- Precompute per‑token top‑k sources (k=8). Downsample long contexts with landmark tokens (newline, punctuation, identifiers). WebGL canvas for heat.
Validity checks
- Warn if softmax temperature >1.2 or top‑k sampling active (attention interpretability caveat). Display effective context length.
Note: Attention visualisation is descriptive; causal claims require validation via ablation (Section 3).
2) Token Size & Confidence Visualisation
Purpose: Reveal how tokenisation granularity (BPE/SentencePiece) interacts with model uncertainty to signal risk during code generation.
Primary view (Token Bar)
- Sequence rendered as chips; width = byte length (or BPE merge depth), opacity = confidence (1−entropy) or
exp(logprob). - Top‑k alternatives on click (with probs) and the source attention snippet that justified each alternative.
- Risk hotspot flags: identifiers split into ≥3 subwords and local entropy peaks.
Secondary widgets
- Entropy sparkline with peaks labelled; toggle to show calibrated thresholds for code tokens (keywords/identifiers/operators may differ).
- Cost/latency estimator: cumulative decoding time and estimated API‑cost (if remote).
Interactions
- Click token → show tokenisation, entropy, top‑k; add as constraint to Ablation (force/ban token); jump to Attention sources.
- Range‑select tokens → aggregate uncertainty and show correlated attention dispersion.
Metrics & study hooks
- Bug‑risk AUC for hotspot flags vs actual error locations.
- Correlation: token entropy vs unit‑test failure spans; pre‑reg threshold (e.g., entropy ≥ 1.5 nats).
3) Ablation Visualisation
Purpose (causal): Show what changes when we disable parts of the architecture or constrain outputs.
Scope constraints (for interactivity)
- Expose only top‑k heads (e.g., k=20) ranked by rollout/gradient contribution.
- Allow layer bypass for ≤2 layers simultaneously.
- Optional FFN gate clamp for a single layer.
- Use a surrogate regressor to predict Δlog‑prob before running heavy re‑decodes; queue background executions.
Controls
- Head toggles: Layer×Head matrix with checkboxes (mask to uniform/zero).
- Layer bypass and token constraints (ban/force).
- Decoding locks: temperature/top‑p pinned to baseline.
Outputs
- Unified diff between baseline and ablated generation.
- Code‑aware metrics: unit tests passed, AST parse success, static‑analysis warnings (ruff/bandit), and Δlog‑prob over altered spans.
- Per‑token delta heat: Δlogprob/Δentropy; small multiples for most‑impactful heads.
Attribution ground truth (for study)
A source token is influential for a generated token if (i) it lies in the top‑k rollout sources and (ii) masking the minimal set of heads that carry that source raises Δlog‑prob ≥ τ (e.g., 0.1) or flips a unit test outcome.
4) Pipeline Visualisation
Purpose: Expose model pipeline and attribution of latency/uncertainty across stages using interpretable layer‑level signals, not raw neuron heatmaps.
Primary view (Swimlane/Timeline)
- Lanes: Tokeniser → Embeddings → Layers (block‑stack) → Logits → Sampler → Post‑proc/Tests.
- For each generated token: rectangles whose length reflects time per stage; colour intensity = uncertainty (entropy). Hover → per‑stage stats.
Layer‑level signals (per token or averaged)
- Residual‑norm z‑scores across layers (outlier spikes flagged).
- Entropy shift from pre‑ to post‑layer logits.
- Attention‑flow saturation (% of attention mass concentrated on top‑m positions).
- Router load if MoE: expert IDs + gate weights and imbalance.
Interactions
- Click a token → cross‑highlight in Attention and Token Size & Confidence.
- Layer bypass (≤2 at a time) to test where decisions crystallise; show predicted impact first, then execute queued ablation.
Operational definitions
- Bottleneck = top‑q percentile of per‑layer latency or residual‑norm spikes; correlate with entropy jumps at the sampler.
5) Study mapping (tasks ↔ visualisations ↔ hypotheses)
- T1 Code completion (5–15 LOC): Attention helps source‑of‑truth tracing; Token Size flags risky fragments; Ablation confirms causal role; Pipeline shows latency/entropy spikes.
- T2 Bug fix from failing tests: Use Attention to localise misleading context; Ablation to test head responsibility; improved pass‑rate/time.
- T3 API usage w/ docs: Token Size shows odd fragmentations of identifiers; Attention confirms copying from docs; Pipeline surfaces sampler uncertainty.
Measures
- Primary: tests passed, time‑to‑pass, number of ablations invoked, SCS causability score, trust calibration (Brier).
- Secondary: SUS for dashboard, NASA‑TLX, qualitative themes.
6) Telemetry & schema
Event types
run.start|end,token.emit,viz.attention.hover,viz.token_size.click,ablation.run,pipeline.hover,test.run.
Minimal log rows
{"event":"token.emit","run":"R2025-10-30-1342","i":37,"tok":"get_user","lp":-0.22,"H":1.08,"time_ms":1.8}
{"event":"ablation.run","mask":[[12,3],[18,7]],"delta":{"tests":-2,"edit_dist":17}}
Storage
- Session JSONL + tensor store (zarr). Export bundle (Run ID, code, tensors, ablation scripts) for reproducibility.
7) Implementation plan (8‑week alpha)
- Week 1–2 – Instrumentation: hooks for attention/residuals; tokenizer stats; timing per stage; zarr writer; minimal API. Add rollout and head ranking.
- Week 3 – Attention view: heatmap (WebGL), head grid, rollout; cross‑links; disclaimer that attention is descriptive.
- Week 4 – Token Size & Confidence view: chip bar, entropy sparkline, hotspot flags, top‑k.
- Week 5 – Ablation view: mask top‑k heads/layers; surrogate predictor; diff viewer; code‑aware metrics.
- Week 6 – Pipeline view: swimlane with residual‑z, entropy shift, saturation, latency; layer bypass (≤2).
- Week 7 – Pilot study (n=3): tune thresholds (entropy τ, Δlog‑prob τ); validate latency; add warnings/tooltips.
- Week 8 – Main study tooling: surveys, Latin‑square, OSF pre‑reg package, export artefact bundle.
8) Validity, pre‑registration & reproducibility
- Validity note: Attention visualisation is descriptive; causal claims are only made when confirmed via ablation deltas.
- Pre‑registration (OSF): include task pool, counterbalancing, metrics (AUC/Δlog‑prob/tests), exclusion criteria, mixed‑effects analysis, MDES.
- Reproducibility: pin seed/checkpoint; publish tensors + telemetry (JSONL + zarr) and replay scripts; anonymise.
9) Study hypotheses (pre‑reg friendly)
- H1‑Attn: Attention+rollout increases correct source identification vs baseline, verified by ablation (OR ≥ 1.8).
- H2‑Tok: Entropy×token‑size hotspots predict bug locations (AUC ≥ 0.70) and reduce time‑to‑diagnosis.
- H3‑Abl: Ablation tool reduces iterations to a passing solution by ≥20%.
- H4‑Pipe: Pipeline summaries improve next‑token prediction and error localisation accuracy.
10) Measurement appendix (formulas)
- Entropy: H = −∑_i p_i log p_i (nats). Threshold τ_H pre‑reg.
- Residual‑norm z: z_l = (||x_l|| − μ_l)/σ_l over corpus pilot.
- Attention rollout: A_roll = softmax(A) composed across layers (Kovaleva‑style).
- Attribution Δ: Δ = log p_baseline(tok) − log p_ablated(tok); influential if Δ ≥ τ_Δ.
11) Power & design guardrails
- Within‑subjects, Latin square; difficulty buckets; record order, LLM familiarity, years' experience.
- Plan for medium effect (d≈0.5): target n=18–24; if n≤12, emphasise large effects + rich qualitative analysis.
Appendix A – Summary Table
| Visualization | Opaque Mechanism | Interpretable Representation | Decision Signal (dev-relevant) | Causal Check |
|---|---|---|---|---|
| Attention | Multi-head self-attention | Token→token rollout heatmaps + head-role grid | Which context spans steer each generated token; recency vs long-range use | Verify via head mask ablations |
| Token Size & Confidence | Softmax over vocab + BPE splits | Token chips: width=bytes, opacity=confidence, entropy sparkline, top-k | Low-confidence identifiers/API calls; multi-split identifiers as risk | Check error rate vs entropy peaks; ablate to flip token |
| Ablation | Component causality (heads/layers/FFN) | Toggle masks + unified diff + Δtests/Δlog-prob | Identify critical vs redundant components; localise bug sources | Intrinsic causal by design |
| Pipeline | Layerwise transformation | Layer timeline: residual-norm z, entropy shift, latency, (router load) | Where decisions "crystallise"; where errors emerge | Cross-check with layer bypass deltas |
Appendix B – Operational Thresholds
| Parameter | Symbol | Value (Initial) | Tuning Method |
|---|---|---|---|
| Entropy threshold | τ_H | 1.5 nats | Pilot study (n=3); calibrate to ~90% specificity |
| Log-prob delta | τ_Δ | 0.1 | Ablation sensitivity; adjust for model scale |
| Residual-norm outlier | τ_z | 2.0 σ | Corpus statistics from 100 samples |
| Recency bias threshold | - | 70% | Arbitrary; flag if >70% attention on last 5 tokens |
| Top-k heads | k | 20 | Performance constraint; expand if latency permits |
Appendix C – Technical Dependencies
Backend (Python)
- PyTorch ≥ 2.0
- transformers ≥ 4.30
- zarr ≥ 2.14
- numpy, scipy
- fastapi, uvicorn
Frontend (Next.js)
- React ≥18
- D3.js or Plotly for visualizations
- WebGL for attention heatmaps
- TailwindCSS for styling
Storage
- Zarr arrays for tensors (chunked by layer, head)
- JSONL for telemetry
- YAML for replay scripts
Appendix D – OSF Pre‑Registration Template (Ready to Copy)
Title: Making Transformer Architecture Transparent for Code Generation: A Developer‑Centric Study of Attention, Token Size & Confidence, Ablation, and Pipeline Visualisations
Principal Investigator: Gary Boon (Northumbria University)
Planned Registration Type: Pre‑Registration (Confirmatory)
1. Research Questions and Hypotheses
RQ1: How can we transform opaque architectural mechanisms into interpretable visual representations that reveal how LLMs make code‑generation decisions?
Sub‑Hypotheses:
- H1‑Attn: Attention+rollout increases correct source identification vs baseline, verified by ablation (OR ≥ 1.8).
- H2‑Tok: Entropy×token‑size hotspots predict bug locations (AUC ≥ 0.70) and reduce time‑to‑diagnosis.
- H3‑Abl: Ablation tool reduces iterations to a passing solution by ≥20%.
- H4‑Pipe: Pipeline summaries improve next‑token prediction and error localisation accuracy.
2. Design
- Design Type: Within‑subjects, Latin square counterbalanced.
- Conditions: Baseline (code inspection only) vs Glass‑Box Dashboard (with 4 visualizations).
- Participants: n = 18–24 software engineers (2–10 years experience).
- Tasks: T1 Code completion (5-15 LOC), T2 Bug fixing from failing tests, T3 API usage with documentation.
- Covariates: LLM familiarity (1-7 scale), order (A→B vs B→A), programming language proficiency, years of experience.
3. Materials and Stimuli
- Model: Code Llama 7B FP16 (specific checkpoint hash recorded).
- Visualisations: Attention (heatmap + head grid), Token Size & Confidence (chip bar + entropy sparkline), Ablation (toggle masks + diff), Pipeline (swimlane timeline).
- Unit‑test harness: pytest with pre-written test suites.
- AST/lint tools: Python
astmodule, ruff, bandit for static analysis.
4. Procedure
- Consent + pre‑survey (10 min): demographics, LLM use frequency, programming experience.
- Tutorial on dashboard (15 min): guided walkthrough of each visualization with example.
- Task blocks (40 min): counterbalanced order (Latin square); 2-3 tasks per condition.
- Post‑task mini‑survey (5 min): SCS (System Causability Scale), Trust scale, NASA‑TLX.
- Semi-structured interview (15 min): qualitative feedback on visualizations, workflow integration.
- Final SUS (5 min): System Usability Scale for dashboard.
Total time: ~90 minutes per participant.
5. Planned Analyses
Quantitative:
- Mixed‑effects models: condition × task + random intercepts for participant/task.
- Metrics: Δlog‑prob (ablation impact), tests passed, time‑to‑fix, AUC(Entropy × Token Size hotspot predictor), OR(H1 - source identification accuracy).
- Software: R (lme4) or Python (statsmodels).
Qualitative:
- Thematic analysis: Braun & Clarke (2021) 6-phase approach.
- Coding: Two researchers independently code transcripts; resolve disagreements via discussion.
- Themes: Mental model formation, trust calibration, workflow integration, visualization utility.
6. Power Analysis
- Effect size target: d = 0.5 (medium effect, Cohen's conventions).
- α = 0.05, power = 0.8 → n ≈ 21 paired observations (within-subjects).
- Planned n = 18-24 to account for dropouts and provide adequate power.
7. Data Management
- Telemetry: JSONL event logs + zarr tensor storage.
- Audio/screen captures: stored on separate encrypted volume; opt-out available.
- Anonymization: Participant IDs (P01-P24); redact file paths, proprietary code.
- Publication: Anonymised artifacts (Run ID bundles, telemetry, survey data) published on OSF upon paper acceptance.
8. Ethics and Risk
- Approval: Northumbria University Ethics Protocol v1.3 (Interpretability Studies).
- Risk level: Minimal. Participants can opt-out anytime; no deception involved.
- Compensation: £25 Amazon voucher per participant.
9. Exclusion Criteria
- Pre-registered:
- < 2 years professional programming experience
- No Python proficiency (self-reported < 4/7)
- Previous participation in pilot study (n=3)
- Incomplete task completion (<50% of tasks)
10. Timeline
- Pilot study (n=3): Week 7 of implementation (threshold tuning).
- Pre-registration submission: End of Week 7 (before main study).
- Main study (n=18-24): Week 8-10.
- Analysis & write-up: Week 11-16.
Appendix E – Pilot Pack
E1. Task T1 – Code Completion
Prompt: "Write a Python function sanitize_sql_like(pattern: str) that escapes SQL LIKE wildcards (%, _) and backslashes."
Ground Truth Outline:
def sanitize_sql_like(pattern: str) -> str:
pattern = pattern.replace("\\", "\\\\")
pattern = pattern.replace("%", "\\%")
pattern = pattern.replace("_", "\\_")
return pattern
Unit Tests (tests/test_sanitize.py):
from main import sanitize_sql_like
import pytest
def test_escape_percent():
assert sanitize_sql_like("100%") == "100\\%"
def test_escape_underscore():
assert sanitize_sql_like("user_name") == "user\\_name"
def test_double_escape():
assert sanitize_sql_like("C:\\path%") == "C:\\\\path\\%"
E2. Task T2 – Bug Fix (Localisation)
Prompt: "This function should reverse a string recursively. Find and fix the bug."
def reverse_string(s: str) -> str:
if len(s) == 1:
return s
return s[0] + reverse_string(s[1:])
Expected fix: return reverse_string(s[1:]) + s[0]
Unit Tests (tests/test_reverse.py):
from main import reverse_string
def test_simple():
assert reverse_string("abc") == "cba"
def test_empty():
assert reverse_string("") == ""
E3. Mini‑Survey Items (Per Task)
7-point Likert scale (1=Strongly Disagree, 7=Strongly Agree):
- I could explain why the model produced this output.
- I trusted the model's output appropriately.
- My workload was high for this task.
- The visualisations were useful for this task.
- My confidence was well‑calibrated to the code's correctness.
E4. Pilot Checklist
- Latency < 300 ms mean for ≤512 tokens.
- Entropy threshold τ_H tuned (~1.5 nats).
- Δlog‑prob threshold τ_Δ tuned (~0.1).
- Verify unit tests pass/fail recorded correctly.
- Survey completion rate ≥ 90%.
- Qualitative feedback indicates visualizations are understandable.
E5. Output Artefacts
Per participant:
run_pack_P01.zip→ Run ID, tensors (zarr), logs (JSONL), test results, survey responses.- Import into OSF for data availability statement.
Aggregate:
pilot_summary.csv→ Metrics, thresholds, latency stats.pilot_feedback.md→ Qualitative themes, suggested improvements.
References
- Jain, S., & Wallace, B. C. (2019). Attention is not Explanation. NAACL.
- Kou, Z., et al. (2024). Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code? FSE.
- Paltenghi, M., et al. (2022). Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration. arXiv.
- Zheng, H., et al. (2025). Attention Heads of Large Language Models: A Survey. arXiv.
- Zhao, H., et al. (2024). Explainability for Large Language Models: A Survey. ACM Digital Library.
- Braun, V., & Clarke, V. (2021). Thematic Analysis: A Practical Guide. SAGE Publications.
- Wang, K., et al. (2022). Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 small. arXiv.
Document History
| Version | Date | Changes | Author |
|---|---|---|---|
| 1.0 | 2025-11-01 | Initial specification document | Gary Boon |
End of Specification Document