# Glass‑Box Dashboard: Spec for 4 Visualisations (Attention • Token Size • Ablation • Pipeline)

*Alpha scope targeting Code Llama 7B; MoE routing optional. Designed to support ICML Paper 1 and RQ1.*

**Version:** 1.0
**Date:** 2025-11-01
**Author:** Gary Boon, Northumbria University
**Status:** Implementation-ready specification

---

## 0) Shared principles & constraints

* **Determinism for study:** fix `seed`, decoding params, checkpoint hash; log all knobs.
* **Latency budget:** initial render < 250 ms for ≤512 tokens; interactive updates < 150 ms. Use lazy tensors + downsampling.
* **Reproducibility:** every view binds to a **Run ID**; each action produces a **Replay Script** (YAML) to re‑execute generation/ablations.
* **Privacy:** no proprietary code unless whitelisted; redact file paths; opt‑out for audio/screen capture.
* **Colour semantics:** one consistent palette; uncertainty → desaturated; stronger evidence → higher opacity; avoid misleading rainbows.

### Core model instrumentation (PyTorch/transformers hooks)

* Capture per‑step: logits, logprobs, entropy; attention tensors `A[L,H,T,T]`; residual norms `||x_l||`; FFN activations (optional SAE features); KV‑cache hits; time per layer.
* Store as memmap/`zarr` with chunking `(layer, head)` to keep interaction snappy.

### Minimal data contract (per token `t_i`)

```json
{
  "id": 37,
  "text": "get_user",
  "bpe": ["get", "_", "user"],
  "byte_len": 8,
  "pos": 37,
  "logprob": -0.22,
  "entropy": 1.08,
  "topk": [{"tok":"(","p":0.21}, {"tok":"_","p":0.18}, {"tok":".","p":0.12}],
  "attn_in": {"layer": L, "head": H, "top_sources": [[pos, weight], ...]},
  "residual_norm": 3.7,
  "time_ms": 1.8
}
```

---

## 1) Attention Visualisation *(descriptive; hypotheses validated via ablation)*

**Purpose (RQ1):** Make cross‑token influence legible; expose head roles; support causal what‑ifs.

### Primary view

* **Token‑to‑token heatmap** (rows = generated tokens, cols = prompt+context), aggregated or per‑head. Hover a token → highlight top‑k sources; tooltips show exact weights and source spans.
* **Head grid** (Layer × Head matrix): mini‑sparklines per head showing mean attention to classes (delimiters, identifiers, comments). Click → overlays that head on main heatmap.
* **Rollout/flow toggle:** attention rollout (Kovaleva‑style) vs raw attention.

### Interactions

* **Brush source span** in context → show downstream tokens most impacted (opacity ∝ weight).
* **Compare decode steps:** scrub generation timeline; diff two steps to see shifting sources.
* **Evidence pinning:** pin a pair (source→target) to the **Ablation** pane.
* **Recency bias flag:** Highlight cases where >70% attention mass concentrates on last 5 tokens (recency bias indicator).

### Algorithms & performance

* Precompute per‑token top‑k sources (k=8). Downsample long contexts with landmark tokens (newline, punctuation, identifiers). WebGL canvas for heat.

### Validity checks

* Warn if softmax temperature >1.2 or top‑k sampling active (attention interpretability caveat). Display effective context length.

**Note:** Attention visualisation is **descriptive**; causal claims require validation via ablation (Section 3).

---

## 2) Token Size & Confidence Visualisation

**Purpose:** Reveal how tokenisation granularity (BPE/SentencePiece) interacts with model uncertainty to signal risk during code generation.

### Primary view (Token Bar)

* Sequence rendered as **chips**; **width** = byte length (or BPE merge depth), **opacity** = confidence (1−entropy) or `exp(logprob)`.
* **Top‑k alternatives** on click (with probs) and the **source attention snippet** that justified each alternative.
* **Risk hotspot flags:** identifiers split into **≥3 subwords** *and* local **entropy peaks**.

### Secondary widgets

* **Entropy sparkline** with peaks labelled; toggle to show **calibrated** thresholds for code tokens (keywords/identifiers/operators may differ).
* **Cost/latency estimator:** cumulative decoding time and estimated API‑cost (if remote).

### Interactions

* Click token → show tokenisation, entropy, top‑k; add as constraint to **Ablation** (force/ban token); jump to **Attention** sources.
* Range‑select tokens → aggregate uncertainty and show correlated attention dispersion.

### Metrics & study hooks

* **Bug‑risk AUC** for hotspot flags vs actual error locations.
* **Correlation**: token entropy vs unit‑test failure spans; pre‑reg threshold (e.g., entropy ≥ 1.5 nats).

---

## 3) Ablation Visualisation

**Purpose (causal):** Show what changes when we disable parts of the architecture or constrain outputs.

### Scope constraints (for interactivity)

* Expose only **top‑k heads** (e.g., k=20) ranked by rollout/gradient contribution.
* Allow **layer bypass** for ≤2 layers simultaneously.
* Optional **FFN gate clamp** for a single layer.
* Use a **surrogate regressor** to predict Δlog‑prob before running heavy re‑decodes; queue background executions.

### Controls

* **Head toggles**: Layer×Head matrix with checkboxes (mask to uniform/zero).
* **Layer bypass** and **token constraints** (ban/force).
* **Decoding locks**: temperature/top‑p pinned to baseline.

### Outputs

* **Unified diff** between baseline and ablated generation.
* **Code‑aware metrics:** unit tests passed, **AST parse success**, static‑analysis warnings (ruff/bandit), and **Δlog‑prob** over altered spans.
* **Per‑token delta heat**: Δlogprob/Δentropy; small multiples for most‑impactful heads.

### Attribution ground truth (for study)

A source token is influential for a generated token if (i) it lies in the top‑k rollout sources **and** (ii) masking the minimal set of heads that carry that source raises Δlog‑prob ≥ τ (e.g., 0.1) or flips a unit test outcome.

---

## 4) Pipeline Visualisation

**Purpose:** Expose model pipeline and attribution of latency/uncertainty across stages using **interpretable layer‑level signals**, not raw neuron heatmaps.

### Primary view (Swimlane/Timeline)

* Lanes: **Tokeniser → Embeddings → Layers (block‑stack) → Logits → Sampler → Post‑proc/Tests**.
* For each generated token: rectangles whose **length** reflects time per stage; colour intensity = uncertainty (entropy). Hover → per‑stage stats.

### Layer‑level signals (per token or averaged)

* **Residual‑norm z‑scores** across layers (outlier spikes flagged).
* **Entropy shift** from pre‑ to post‑layer logits.
* **Attention‑flow saturation** (% of attention mass concentrated on top‑m positions).
* **Router load** if MoE: expert IDs + gate weights and imbalance.

### Interactions

* Click a token → cross‑highlight in **Attention** and **Token Size & Confidence**.
* **Layer bypass** (≤2 at a time) to test where decisions crystallise; show predicted impact first, then execute queued ablation.

### Operational definitions

* **Bottleneck** = top‑q percentile of per‑layer latency or residual‑norm spikes; correlate with entropy jumps at the sampler.

---

## 5) Study mapping (tasks ↔ visualisations ↔ hypotheses)

* **T1 Code completion (5–15 LOC):** Attention helps source‑of‑truth tracing; Token Size flags risky fragments; Ablation confirms causal role; Pipeline shows latency/entropy spikes.
* **T2 Bug fix from failing tests:** Use Attention to localise misleading context; Ablation to test head responsibility; improved pass‑rate/time.
* **T3 API usage w/ docs:** Token Size shows odd fragmentations of identifiers; Attention confirms copying from docs; Pipeline surfaces sampler uncertainty.

### Measures

* Primary: tests passed, time‑to‑pass, number of ablations invoked, SCS causability score, trust calibration (Brier).
* Secondary: SUS for dashboard, NASA‑TLX, qualitative themes.

---

## 6) Telemetry & schema

### Event types

* `run.start|end`, `token.emit`, `viz.attention.hover`, `viz.token_size.click`, `ablation.run`, `pipeline.hover`, `test.run`.

### Minimal log rows

```json
{"event":"token.emit","run":"R2025-10-30-1342","i":37,"tok":"get_user","lp":-0.22,"H":1.08,"time_ms":1.8}
{"event":"ablation.run","mask":[[12,3],[18,7]],"delta":{"tests":-2,"edit_dist":17}}
```

### Storage

* Session JSONL + tensor store (zarr). Export bundle (Run ID, code, tensors, ablation scripts) for reproducibility.

---

## 7) Implementation plan (8‑week alpha)

* **Week 1–2 – Instrumentation**: hooks for attention/residuals; tokenizer stats; timing per stage; zarr writer; minimal API. Add rollout and head ranking.
* **Week 3 – Attention view**: heatmap (WebGL), head grid, rollout; cross‑links; disclaimer that attention is descriptive.
* **Week 4 – Token Size & Confidence view**: chip bar, entropy sparkline, hotspot flags, top‑k.
* **Week 5 – Ablation view**: mask top‑k heads/layers; surrogate predictor; diff viewer; code‑aware metrics.
* **Week 6 – Pipeline view**: swimlane with residual‑z, entropy shift, saturation, latency; layer bypass (≤2).
* **Week 7 – Pilot study (n=3)**: tune thresholds (entropy τ, Δlog‑prob τ); validate latency; add warnings/tooltips.
* **Week 8 – Main study tooling**: surveys, Latin‑square, OSF pre‑reg package, export artefact bundle.

---

## 8) Validity, pre‑registration & reproducibility

* **Validity note:** Attention visualisation is **descriptive**; causal claims are only made when confirmed via **ablation deltas**.
* **Pre‑registration (OSF):** include task pool, counterbalancing, metrics (AUC/Δlog‑prob/tests), exclusion criteria, mixed‑effects analysis, MDES.
* **Reproducibility:** pin seed/checkpoint; publish tensors + telemetry (JSONL + zarr) and replay scripts; anonymise.

---

## 9) Study hypotheses (pre‑reg friendly)

* **H1‑Attn:** Attention+rollout increases correct source identification vs baseline, verified by ablation (OR ≥ 1.8).
* **H2‑Tok:** Entropy×token‑size hotspots predict bug locations (AUC ≥ 0.70) and reduce time‑to‑diagnosis.
* **H3‑Abl:** Ablation tool reduces iterations to a passing solution by ≥20%.
* **H4‑Pipe:** Pipeline summaries improve next‑token prediction and error localisation accuracy.

---

## 10) Measurement appendix (formulas)

* **Entropy**: H = −∑_i p_i log p_i (nats). Threshold τ_H pre‑reg.
* **Residual‑norm z**: z_l = (||x_l|| − μ_l)/σ_l over corpus pilot.
* **Attention rollout**: A_roll = softmax(A) composed across layers (Kovaleva‑style).
* **Attribution Δ**: Δ = log p_baseline(tok) − log p_ablated(tok); influential if Δ ≥ τ_Δ.

---

## 11) Power & design guardrails

* Within‑subjects, Latin square; difficulty buckets; record order, LLM familiarity, years' experience.
* Plan for **medium effect** (d≈0.5): target n=18–24; if n≤12, emphasise large effects + rich qualitative analysis.

---

## Appendix A – Summary Table

| Visualization | Opaque Mechanism | Interpretable Representation | Decision Signal (dev-relevant) | Causal Check |
|--------------|------------------|----------------------------|--------------------------------|--------------|
| **Attention** | Multi-head self-attention | Token→token rollout heatmaps + head-role grid | Which context spans steer each generated token; recency vs long-range use | Verify via head mask ablations |
| **Token Size & Confidence** | Softmax over vocab + BPE splits | Token chips: width=bytes, opacity=confidence, entropy sparkline, top-k | Low-confidence identifiers/API calls; multi-split identifiers as risk | Check error rate vs entropy peaks; ablate to flip token |
| **Ablation** | Component causality (heads/layers/FFN) | Toggle masks + unified diff + Δtests/Δlog-prob | Identify critical vs redundant components; localise bug sources | Intrinsic causal by design |
| **Pipeline** | Layerwise transformation | Layer timeline: residual-norm z, entropy shift, latency, (router load) | Where decisions "crystallise"; where errors emerge | Cross-check with layer bypass deltas |

---

## Appendix B – Operational Thresholds

| Parameter | Symbol | Value (Initial) | Tuning Method |
|-----------|--------|----------------|---------------|
| Entropy threshold | τ_H | 1.5 nats | Pilot study (n=3); calibrate to ~90% specificity |
| Log-prob delta | τ_Δ | 0.1 | Ablation sensitivity; adjust for model scale |
| Residual-norm outlier | τ_z | 2.0 σ | Corpus statistics from 100 samples |
| Recency bias threshold | - | 70% | Arbitrary; flag if >70% attention on last 5 tokens |
| Top-k heads | k | 20 | Performance constraint; expand if latency permits |

---

## Appendix C – Technical Dependencies

### Backend (Python)
- PyTorch ≥ 2.0
- transformers ≥ 4.30
- zarr ≥ 2.14
- numpy, scipy
- fastapi, uvicorn

### Frontend (Next.js)
- React ≥18
- D3.js or Plotly for visualizations
- WebGL for attention heatmaps
- TailwindCSS for styling

### Storage
- Zarr arrays for tensors (chunked by layer, head)
- JSONL for telemetry
- YAML for replay scripts

---

## Appendix D – OSF Pre‑Registration Template (Ready to Copy)

**Title:** Making Transformer Architecture Transparent for Code Generation: A Developer‑Centric Study of Attention, Token Size & Confidence, Ablation, and Pipeline Visualisations

**Principal Investigator:** Gary Boon (Northumbria University)

**Planned Registration Type:** Pre‑Registration (Confirmatory)

### 1. Research Questions and Hypotheses

**RQ1:** How can we transform opaque architectural mechanisms into interpretable visual representations that reveal how LLMs make code‑generation decisions?

**Sub‑Hypotheses:**
- **H1‑Attn:** Attention+rollout increases correct source identification vs baseline, verified by ablation (OR ≥ 1.8).
- **H2‑Tok:** Entropy×token‑size hotspots predict bug locations (AUC ≥ 0.70) and reduce time‑to‑diagnosis.
- **H3‑Abl:** Ablation tool reduces iterations to a passing solution by ≥20%.
- **H4‑Pipe:** Pipeline summaries improve next‑token prediction and error localisation accuracy.

### 2. Design

* **Design Type:** Within‑subjects, Latin square counterbalanced.
* **Conditions:** Baseline (code inspection only) vs Glass‑Box Dashboard (with 4 visualizations).
* **Participants:** n = 18–24 software engineers (2–10 years experience).
* **Tasks:** T1 Code completion (5-15 LOC), T2 Bug fixing from failing tests, T3 API usage with documentation.
* **Covariates:** LLM familiarity (1-7 scale), order (A→B vs B→A), programming language proficiency, years of experience.

### 3. Materials and Stimuli

* **Model:** Code Llama 7B FP16 (specific checkpoint hash recorded).
* **Visualisations:** Attention (heatmap + head grid), Token Size & Confidence (chip bar + entropy sparkline), Ablation (toggle masks + diff), Pipeline (swimlane timeline).
* **Unit‑test harness:** pytest with pre-written test suites.
* **AST/lint tools:** Python `ast` module, ruff, bandit for static analysis.

### 4. Procedure

1. **Consent + pre‑survey** (10 min): demographics, LLM use frequency, programming experience.
2. **Tutorial on dashboard** (15 min): guided walkthrough of each visualization with example.
3. **Task blocks** (40 min): counterbalanced order (Latin square); 2-3 tasks per condition.
4. **Post‑task mini‑survey** (5 min): SCS (System Causability Scale), Trust scale, NASA‑TLX.
5. **Semi-structured interview** (15 min): qualitative feedback on visualizations, workflow integration.
6. **Final SUS** (5 min): System Usability Scale for dashboard.

**Total time:** ~90 minutes per participant.

### 5. Planned Analyses

**Quantitative:**
- **Mixed‑effects models:** condition × task + random intercepts for participant/task.
- **Metrics:** Δlog‑prob (ablation impact), tests passed, time‑to‑fix, AUC(Entropy × Token Size hotspot predictor), OR(H1 - source identification accuracy).
- **Software:** R (lme4) or Python (statsmodels).

**Qualitative:**
- **Thematic analysis:** Braun & Clarke (2021) 6-phase approach.
- **Coding:** Two researchers independently code transcripts; resolve disagreements via discussion.
- **Themes:** Mental model formation, trust calibration, workflow integration, visualization utility.

### 6. Power Analysis

* **Effect size target:** d = 0.5 (medium effect, Cohen's conventions).
* **α = 0.05, power = 0.8** → n ≈ 21 paired observations (within-subjects).
* **Planned n = 18-24** to account for dropouts and provide adequate power.

### 7. Data Management

* **Telemetry:** JSONL event logs + zarr tensor storage.
* **Audio/screen captures:** stored on separate encrypted volume; opt-out available.
* **Anonymization:** Participant IDs (P01-P24); redact file paths, proprietary code.
* **Publication:** Anonymised artifacts (Run ID bundles, telemetry, survey data) published on OSF upon paper acceptance.

### 8. Ethics and Risk

* **Approval:** Northumbria University Ethics Protocol v1.3 (Interpretability Studies).
* **Risk level:** Minimal. Participants can opt-out anytime; no deception involved.
* **Compensation:** £25 Amazon voucher per participant.

### 9. Exclusion Criteria

* **Pre-registered:**
  - < 2 years professional programming experience
  - No Python proficiency (self-reported < 4/7)
  - Previous participation in pilot study (n=3)
  - Incomplete task completion (<50% of tasks)

### 10. Timeline

* **Pilot study (n=3):** Week 7 of implementation (threshold tuning).
* **Pre-registration submission:** End of Week 7 (before main study).
* **Main study (n=18-24):** Week 8-10.
* **Analysis & write-up:** Week 11-16.

---

## Appendix E – Pilot Pack

### E1. Task T1 – Code Completion

**Prompt:** "Write a Python function `sanitize_sql_like(pattern: str)` that escapes SQL LIKE wildcards (%, _) and backslashes."

**Ground Truth Outline:**

```python
def sanitize_sql_like(pattern: str) -> str:
    pattern = pattern.replace("\\", "\\\\")
    pattern = pattern.replace("%", "\\%")
    pattern = pattern.replace("_", "\\_")
    return pattern
```

**Unit Tests (`tests/test_sanitize.py`):**

```python
from main import sanitize_sql_like
import pytest

def test_escape_percent():
    assert sanitize_sql_like("100%") == "100\\%"

def test_escape_underscore():
    assert sanitize_sql_like("user_name") == "user\\_name"

def test_double_escape():
    assert sanitize_sql_like("C:\\path%") == "C:\\\\path\\%"
```

### E2. Task T2 – Bug Fix (Localisation)

**Prompt:** "This function should reverse a string recursively. Find and fix the bug."

```python
def reverse_string(s: str) -> str:
    if len(s) == 1:
        return s
    return s[0] + reverse_string(s[1:])
```

**Expected fix:** `return reverse_string(s[1:]) + s[0]`

**Unit Tests (`tests/test_reverse.py`):**

```python
from main import reverse_string

def test_simple():
    assert reverse_string("abc") == "cba"

def test_empty():
    assert reverse_string("") == ""
```

### E3. Mini‑Survey Items (Per Task)

**7-point Likert scale (1=Strongly Disagree, 7=Strongly Agree):**

1. I could explain why the model produced this output.
2. I trusted the model's output appropriately.
3. My workload was high for this task.
4. The visualisations were useful for this task.
5. My confidence was well‑calibrated to the code's correctness.

### E4. Pilot Checklist

- [ ] Latency < 300 ms mean for ≤512 tokens.
- [ ] Entropy threshold τ_H tuned (~1.5 nats).
- [ ] Δlog‑prob threshold τ_Δ tuned (~0.1).
- [ ] Verify unit tests pass/fail recorded correctly.
- [ ] Survey completion rate ≥ 90%.
- [ ] Qualitative feedback indicates visualizations are understandable.

### E5. Output Artefacts

**Per participant:**
- `run_pack_P01.zip` → Run ID, tensors (zarr), logs (JSONL), test results, survey responses.
- Import into OSF for data availability statement.

**Aggregate:**
- `pilot_summary.csv` → Metrics, thresholds, latency stats.
- `pilot_feedback.md` → Qualitative themes, suggested improvements.

---

## References

- **Jain, S., & Wallace, B. C. (2019).** Attention is not Explanation. *NAACL*.
- **Kou, Z., et al. (2024).** Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code? *FSE*.
- **Paltenghi, M., et al. (2022).** Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration. *arXiv*.
- **Zheng, H., et al. (2025).** Attention Heads of Large Language Models: A Survey. *arXiv*.
- **Zhao, H., et al. (2024).** Explainability for Large Language Models: A Survey. *ACM Digital Library*.
- **Braun, V., & Clarke, V. (2021).** Thematic Analysis: A Practical Guide. *SAGE Publications*.
- **Wang, K., et al. (2022).** Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 small. *arXiv*.

---

## Document History

| Version | Date | Changes | Author |
|---------|------|---------|--------|
| 1.0 | 2025-11-01 | Initial specification document | Gary Boon |

---

**End of Specification Document**