# Glass‑Box Dashboard: Spec for 4 Visualisations (Attention • Token Size • Ablation • Pipeline) *Alpha scope targeting Code Llama 7B; MoE routing optional. Designed to support ICML Paper 1 and RQ1.* **Version:** 1.0 **Date:** 2025-11-01 **Author:** Gary Boon, Northumbria University **Status:** Implementation-ready specification --- ## 0) Shared principles & constraints * **Determinism for study:** fix `seed`, decoding params, checkpoint hash; log all knobs. * **Latency budget:** initial render < 250 ms for ≤512 tokens; interactive updates < 150 ms. Use lazy tensors + downsampling. * **Reproducibility:** every view binds to a **Run ID**; each action produces a **Replay Script** (YAML) to re‑execute generation/ablations. * **Privacy:** no proprietary code unless whitelisted; redact file paths; opt‑out for audio/screen capture. * **Colour semantics:** one consistent palette; uncertainty → desaturated; stronger evidence → higher opacity; avoid misleading rainbows. ### Core model instrumentation (PyTorch/transformers hooks) * Capture per‑step: logits, logprobs, entropy; attention tensors `A[L,H,T,T]`; residual norms `||x_l||`; FFN activations (optional SAE features); KV‑cache hits; time per layer. * Store as memmap/`zarr` with chunking `(layer, head)` to keep interaction snappy. ### Minimal data contract (per token `t_i`) ```json { "id": 37, "text": "get_user", "bpe": ["get", "_", "user"], "byte_len": 8, "pos": 37, "logprob": -0.22, "entropy": 1.08, "topk": [{"tok":"(","p":0.21}, {"tok":"_","p":0.18}, {"tok":".","p":0.12}], "attn_in": {"layer": L, "head": H, "top_sources": [[pos, weight], ...]}, "residual_norm": 3.7, "time_ms": 1.8 } ``` --- ## 1) Attention Visualisation *(descriptive; hypotheses validated via ablation)* **Purpose (RQ1):** Make cross‑token influence legible; expose head roles; support causal what‑ifs. ### Primary view * **Token‑to‑token heatmap** (rows = generated tokens, cols = prompt+context), aggregated or per‑head. Hover a token → highlight top‑k sources; tooltips show exact weights and source spans. * **Head grid** (Layer × Head matrix): mini‑sparklines per head showing mean attention to classes (delimiters, identifiers, comments). Click → overlays that head on main heatmap. * **Rollout/flow toggle:** attention rollout (Kovaleva‑style) vs raw attention. ### Interactions * **Brush source span** in context → show downstream tokens most impacted (opacity ∝ weight). * **Compare decode steps:** scrub generation timeline; diff two steps to see shifting sources. * **Evidence pinning:** pin a pair (source→target) to the **Ablation** pane. * **Recency bias flag:** Highlight cases where >70% attention mass concentrates on last 5 tokens (recency bias indicator). ### Algorithms & performance * Precompute per‑token top‑k sources (k=8). Downsample long contexts with landmark tokens (newline, punctuation, identifiers). WebGL canvas for heat. ### Validity checks * Warn if softmax temperature >1.2 or top‑k sampling active (attention interpretability caveat). Display effective context length. **Note:** Attention visualisation is **descriptive**; causal claims require validation via ablation (Section 3). --- ## 2) Token Size & Confidence Visualisation **Purpose:** Reveal how tokenisation granularity (BPE/SentencePiece) interacts with model uncertainty to signal risk during code generation. ### Primary view (Token Bar) * Sequence rendered as **chips**; **width** = byte length (or BPE merge depth), **opacity** = confidence (1−entropy) or `exp(logprob)`. * **Top‑k alternatives** on click (with probs) and the **source attention snippet** that justified each alternative. * **Risk hotspot flags:** identifiers split into **≥3 subwords** *and* local **entropy peaks**. ### Secondary widgets * **Entropy sparkline** with peaks labelled; toggle to show **calibrated** thresholds for code tokens (keywords/identifiers/operators may differ). * **Cost/latency estimator:** cumulative decoding time and estimated API‑cost (if remote). ### Interactions * Click token → show tokenisation, entropy, top‑k; add as constraint to **Ablation** (force/ban token); jump to **Attention** sources. * Range‑select tokens → aggregate uncertainty and show correlated attention dispersion. ### Metrics & study hooks * **Bug‑risk AUC** for hotspot flags vs actual error locations. * **Correlation**: token entropy vs unit‑test failure spans; pre‑reg threshold (e.g., entropy ≥ 1.5 nats). --- ## 3) Ablation Visualisation **Purpose (causal):** Show what changes when we disable parts of the architecture or constrain outputs. ### Scope constraints (for interactivity) * Expose only **top‑k heads** (e.g., k=20) ranked by rollout/gradient contribution. * Allow **layer bypass** for ≤2 layers simultaneously. * Optional **FFN gate clamp** for a single layer. * Use a **surrogate regressor** to predict Δlog‑prob before running heavy re‑decodes; queue background executions. ### Controls * **Head toggles**: Layer×Head matrix with checkboxes (mask to uniform/zero). * **Layer bypass** and **token constraints** (ban/force). * **Decoding locks**: temperature/top‑p pinned to baseline. ### Outputs * **Unified diff** between baseline and ablated generation. * **Code‑aware metrics:** unit tests passed, **AST parse success**, static‑analysis warnings (ruff/bandit), and **Δlog‑prob** over altered spans. * **Per‑token delta heat**: Δlogprob/Δentropy; small multiples for most‑impactful heads. ### Attribution ground truth (for study) A source token is influential for a generated token if (i) it lies in the top‑k rollout sources **and** (ii) masking the minimal set of heads that carry that source raises Δlog‑prob ≥ τ (e.g., 0.1) or flips a unit test outcome. --- ## 4) Pipeline Visualisation **Purpose:** Expose model pipeline and attribution of latency/uncertainty across stages using **interpretable layer‑level signals**, not raw neuron heatmaps. ### Primary view (Swimlane/Timeline) * Lanes: **Tokeniser → Embeddings → Layers (block‑stack) → Logits → Sampler → Post‑proc/Tests**. * For each generated token: rectangles whose **length** reflects time per stage; colour intensity = uncertainty (entropy). Hover → per‑stage stats. ### Layer‑level signals (per token or averaged) * **Residual‑norm z‑scores** across layers (outlier spikes flagged). * **Entropy shift** from pre‑ to post‑layer logits. * **Attention‑flow saturation** (% of attention mass concentrated on top‑m positions). * **Router load** if MoE: expert IDs + gate weights and imbalance. ### Interactions * Click a token → cross‑highlight in **Attention** and **Token Size & Confidence**. * **Layer bypass** (≤2 at a time) to test where decisions crystallise; show predicted impact first, then execute queued ablation. ### Operational definitions * **Bottleneck** = top‑q percentile of per‑layer latency or residual‑norm spikes; correlate with entropy jumps at the sampler. --- ## 5) Study mapping (tasks ↔ visualisations ↔ hypotheses) * **T1 Code completion (5–15 LOC):** Attention helps source‑of‑truth tracing; Token Size flags risky fragments; Ablation confirms causal role; Pipeline shows latency/entropy spikes. * **T2 Bug fix from failing tests:** Use Attention to localise misleading context; Ablation to test head responsibility; improved pass‑rate/time. * **T3 API usage w/ docs:** Token Size shows odd fragmentations of identifiers; Attention confirms copying from docs; Pipeline surfaces sampler uncertainty. ### Measures * Primary: tests passed, time‑to‑pass, number of ablations invoked, SCS causability score, trust calibration (Brier). * Secondary: SUS for dashboard, NASA‑TLX, qualitative themes. --- ## 6) Telemetry & schema ### Event types * `run.start|end`, `token.emit`, `viz.attention.hover`, `viz.token_size.click`, `ablation.run`, `pipeline.hover`, `test.run`. ### Minimal log rows ```json {"event":"token.emit","run":"R2025-10-30-1342","i":37,"tok":"get_user","lp":-0.22,"H":1.08,"time_ms":1.8} {"event":"ablation.run","mask":[[12,3],[18,7]],"delta":{"tests":-2,"edit_dist":17}} ``` ### Storage * Session JSONL + tensor store (zarr). Export bundle (Run ID, code, tensors, ablation scripts) for reproducibility. --- ## 7) Implementation plan (8‑week alpha) * **Week 1–2 – Instrumentation**: hooks for attention/residuals; tokenizer stats; timing per stage; zarr writer; minimal API. Add rollout and head ranking. * **Week 3 – Attention view**: heatmap (WebGL), head grid, rollout; cross‑links; disclaimer that attention is descriptive. * **Week 4 – Token Size & Confidence view**: chip bar, entropy sparkline, hotspot flags, top‑k. * **Week 5 – Ablation view**: mask top‑k heads/layers; surrogate predictor; diff viewer; code‑aware metrics. * **Week 6 – Pipeline view**: swimlane with residual‑z, entropy shift, saturation, latency; layer bypass (≤2). * **Week 7 – Pilot study (n=3)**: tune thresholds (entropy τ, Δlog‑prob τ); validate latency; add warnings/tooltips. * **Week 8 – Main study tooling**: surveys, Latin‑square, OSF pre‑reg package, export artefact bundle. --- ## 8) Validity, pre‑registration & reproducibility * **Validity note:** Attention visualisation is **descriptive**; causal claims are only made when confirmed via **ablation deltas**. * **Pre‑registration (OSF):** include task pool, counterbalancing, metrics (AUC/Δlog‑prob/tests), exclusion criteria, mixed‑effects analysis, MDES. * **Reproducibility:** pin seed/checkpoint; publish tensors + telemetry (JSONL + zarr) and replay scripts; anonymise. --- ## 9) Study hypotheses (pre‑reg friendly) * **H1‑Attn:** Attention+rollout increases correct source identification vs baseline, verified by ablation (OR ≥ 1.8). * **H2‑Tok:** Entropy×token‑size hotspots predict bug locations (AUC ≥ 0.70) and reduce time‑to‑diagnosis. * **H3‑Abl:** Ablation tool reduces iterations to a passing solution by ≥20%. * **H4‑Pipe:** Pipeline summaries improve next‑token prediction and error localisation accuracy. --- ## 10) Measurement appendix (formulas) * **Entropy**: H = −∑_i p_i log p_i (nats). Threshold τ_H pre‑reg. * **Residual‑norm z**: z_l = (||x_l|| − μ_l)/σ_l over corpus pilot. * **Attention rollout**: A_roll = softmax(A) composed across layers (Kovaleva‑style). * **Attribution Δ**: Δ = log p_baseline(tok) − log p_ablated(tok); influential if Δ ≥ τ_Δ. --- ## 11) Power & design guardrails * Within‑subjects, Latin square; difficulty buckets; record order, LLM familiarity, years' experience. * Plan for **medium effect** (d≈0.5): target n=18–24; if n≤12, emphasise large effects + rich qualitative analysis. --- ## Appendix A – Summary Table | Visualization | Opaque Mechanism | Interpretable Representation | Decision Signal (dev-relevant) | Causal Check | |--------------|------------------|----------------------------|--------------------------------|--------------| | **Attention** | Multi-head self-attention | Token→token rollout heatmaps + head-role grid | Which context spans steer each generated token; recency vs long-range use | Verify via head mask ablations | | **Token Size & Confidence** | Softmax over vocab + BPE splits | Token chips: width=bytes, opacity=confidence, entropy sparkline, top-k | Low-confidence identifiers/API calls; multi-split identifiers as risk | Check error rate vs entropy peaks; ablate to flip token | | **Ablation** | Component causality (heads/layers/FFN) | Toggle masks + unified diff + Δtests/Δlog-prob | Identify critical vs redundant components; localise bug sources | Intrinsic causal by design | | **Pipeline** | Layerwise transformation | Layer timeline: residual-norm z, entropy shift, latency, (router load) | Where decisions "crystallise"; where errors emerge | Cross-check with layer bypass deltas | --- ## Appendix B – Operational Thresholds | Parameter | Symbol | Value (Initial) | Tuning Method | |-----------|--------|----------------|---------------| | Entropy threshold | τ_H | 1.5 nats | Pilot study (n=3); calibrate to ~90% specificity | | Log-prob delta | τ_Δ | 0.1 | Ablation sensitivity; adjust for model scale | | Residual-norm outlier | τ_z | 2.0 σ | Corpus statistics from 100 samples | | Recency bias threshold | - | 70% | Arbitrary; flag if >70% attention on last 5 tokens | | Top-k heads | k | 20 | Performance constraint; expand if latency permits | --- ## Appendix C – Technical Dependencies ### Backend (Python) - PyTorch ≥ 2.0 - transformers ≥ 4.30 - zarr ≥ 2.14 - numpy, scipy - fastapi, uvicorn ### Frontend (Next.js) - React ≥18 - D3.js or Plotly for visualizations - WebGL for attention heatmaps - TailwindCSS for styling ### Storage - Zarr arrays for tensors (chunked by layer, head) - JSONL for telemetry - YAML for replay scripts --- ## Appendix D – OSF Pre‑Registration Template (Ready to Copy) **Title:** Making Transformer Architecture Transparent for Code Generation: A Developer‑Centric Study of Attention, Token Size & Confidence, Ablation, and Pipeline Visualisations **Principal Investigator:** Gary Boon (Northumbria University) **Planned Registration Type:** Pre‑Registration (Confirmatory) ### 1. Research Questions and Hypotheses **RQ1:** How can we transform opaque architectural mechanisms into interpretable visual representations that reveal how LLMs make code‑generation decisions? **Sub‑Hypotheses:** - **H1‑Attn:** Attention+rollout increases correct source identification vs baseline, verified by ablation (OR ≥ 1.8). - **H2‑Tok:** Entropy×token‑size hotspots predict bug locations (AUC ≥ 0.70) and reduce time‑to‑diagnosis. - **H3‑Abl:** Ablation tool reduces iterations to a passing solution by ≥20%. - **H4‑Pipe:** Pipeline summaries improve next‑token prediction and error localisation accuracy. ### 2. Design * **Design Type:** Within‑subjects, Latin square counterbalanced. * **Conditions:** Baseline (code inspection only) vs Glass‑Box Dashboard (with 4 visualizations). * **Participants:** n = 18–24 software engineers (2–10 years experience). * **Tasks:** T1 Code completion (5-15 LOC), T2 Bug fixing from failing tests, T3 API usage with documentation. * **Covariates:** LLM familiarity (1-7 scale), order (A→B vs B→A), programming language proficiency, years of experience. ### 3. Materials and Stimuli * **Model:** Code Llama 7B FP16 (specific checkpoint hash recorded). * **Visualisations:** Attention (heatmap + head grid), Token Size & Confidence (chip bar + entropy sparkline), Ablation (toggle masks + diff), Pipeline (swimlane timeline). * **Unit‑test harness:** pytest with pre-written test suites. * **AST/lint tools:** Python `ast` module, ruff, bandit for static analysis. ### 4. Procedure 1. **Consent + pre‑survey** (10 min): demographics, LLM use frequency, programming experience. 2. **Tutorial on dashboard** (15 min): guided walkthrough of each visualization with example. 3. **Task blocks** (40 min): counterbalanced order (Latin square); 2-3 tasks per condition. 4. **Post‑task mini‑survey** (5 min): SCS (System Causability Scale), Trust scale, NASA‑TLX. 5. **Semi-structured interview** (15 min): qualitative feedback on visualizations, workflow integration. 6. **Final SUS** (5 min): System Usability Scale for dashboard. **Total time:** ~90 minutes per participant. ### 5. Planned Analyses **Quantitative:** - **Mixed‑effects models:** condition × task + random intercepts for participant/task. - **Metrics:** Δlog‑prob (ablation impact), tests passed, time‑to‑fix, AUC(Entropy × Token Size hotspot predictor), OR(H1 - source identification accuracy). - **Software:** R (lme4) or Python (statsmodels). **Qualitative:** - **Thematic analysis:** Braun & Clarke (2021) 6-phase approach. - **Coding:** Two researchers independently code transcripts; resolve disagreements via discussion. - **Themes:** Mental model formation, trust calibration, workflow integration, visualization utility. ### 6. Power Analysis * **Effect size target:** d = 0.5 (medium effect, Cohen's conventions). * **α = 0.05, power = 0.8** → n ≈ 21 paired observations (within-subjects). * **Planned n = 18-24** to account for dropouts and provide adequate power. ### 7. Data Management * **Telemetry:** JSONL event logs + zarr tensor storage. * **Audio/screen captures:** stored on separate encrypted volume; opt-out available. * **Anonymization:** Participant IDs (P01-P24); redact file paths, proprietary code. * **Publication:** Anonymised artifacts (Run ID bundles, telemetry, survey data) published on OSF upon paper acceptance. ### 8. Ethics and Risk * **Approval:** Northumbria University Ethics Protocol v1.3 (Interpretability Studies). * **Risk level:** Minimal. Participants can opt-out anytime; no deception involved. * **Compensation:** £25 Amazon voucher per participant. ### 9. Exclusion Criteria * **Pre-registered:** - < 2 years professional programming experience - No Python proficiency (self-reported < 4/7) - Previous participation in pilot study (n=3) - Incomplete task completion (<50% of tasks) ### 10. Timeline * **Pilot study (n=3):** Week 7 of implementation (threshold tuning). * **Pre-registration submission:** End of Week 7 (before main study). * **Main study (n=18-24):** Week 8-10. * **Analysis & write-up:** Week 11-16. --- ## Appendix E – Pilot Pack ### E1. Task T1 – Code Completion **Prompt:** "Write a Python function `sanitize_sql_like(pattern: str)` that escapes SQL LIKE wildcards (%, _) and backslashes." **Ground Truth Outline:** ```python def sanitize_sql_like(pattern: str) -> str: pattern = pattern.replace("\\", "\\\\") pattern = pattern.replace("%", "\\%") pattern = pattern.replace("_", "\\_") return pattern ``` **Unit Tests (`tests/test_sanitize.py`):** ```python from main import sanitize_sql_like import pytest def test_escape_percent(): assert sanitize_sql_like("100%") == "100\\%" def test_escape_underscore(): assert sanitize_sql_like("user_name") == "user\\_name" def test_double_escape(): assert sanitize_sql_like("C:\\path%") == "C:\\\\path\\%" ``` ### E2. Task T2 – Bug Fix (Localisation) **Prompt:** "This function should reverse a string recursively. Find and fix the bug." ```python def reverse_string(s: str) -> str: if len(s) == 1: return s return s[0] + reverse_string(s[1:]) ``` **Expected fix:** `return reverse_string(s[1:]) + s[0]` **Unit Tests (`tests/test_reverse.py`):** ```python from main import reverse_string def test_simple(): assert reverse_string("abc") == "cba" def test_empty(): assert reverse_string("") == "" ``` ### E3. Mini‑Survey Items (Per Task) **7-point Likert scale (1=Strongly Disagree, 7=Strongly Agree):** 1. I could explain why the model produced this output. 2. I trusted the model's output appropriately. 3. My workload was high for this task. 4. The visualisations were useful for this task. 5. My confidence was well‑calibrated to the code's correctness. ### E4. Pilot Checklist - [ ] Latency < 300 ms mean for ≤512 tokens. - [ ] Entropy threshold τ_H tuned (~1.5 nats). - [ ] Δlog‑prob threshold τ_Δ tuned (~0.1). - [ ] Verify unit tests pass/fail recorded correctly. - [ ] Survey completion rate ≥ 90%. - [ ] Qualitative feedback indicates visualizations are understandable. ### E5. Output Artefacts **Per participant:** - `run_pack_P01.zip` → Run ID, tensors (zarr), logs (JSONL), test results, survey responses. - Import into OSF for data availability statement. **Aggregate:** - `pilot_summary.csv` → Metrics, thresholds, latency stats. - `pilot_feedback.md` → Qualitative themes, suggested improvements. --- ## References - **Jain, S., & Wallace, B. C. (2019).** Attention is not Explanation. *NAACL*. - **Kou, Z., et al. (2024).** Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code? *FSE*. - **Paltenghi, M., et al. (2022).** Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration. *arXiv*. - **Zheng, H., et al. (2025).** Attention Heads of Large Language Models: A Survey. *arXiv*. - **Zhao, H., et al. (2024).** Explainability for Large Language Models: A Survey. *ACM Digital Library*. - **Braun, V., & Clarke, V. (2021).** Thematic Analysis: A Practical Guide. *SAGE Publications*. - **Wang, K., et al. (2022).** Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 small. *arXiv*. --- ## Document History | Version | Date | Changes | Author | |---------|------|---------|--------| | 1.0 | 2025-11-01 | Initial specification document | Gary Boon | --- **End of Specification Document**