Spaces:
Sleeping
Sleeping
| # Glass‑Box Dashboard: Spec for 4 Visualisations (Attention • Token Size • Ablation • Pipeline) | |
| *Alpha scope targeting Code Llama 7B; MoE routing optional. Designed to support ICML Paper 1 and RQ1.* | |
| **Version:** 1.0 | |
| **Date:** 2025-11-01 | |
| **Author:** Gary Boon, Northumbria University | |
| **Status:** Implementation-ready specification | |
| --- | |
| ## 0) Shared principles & constraints | |
| * **Determinism for study:** fix `seed`, decoding params, checkpoint hash; log all knobs. | |
| * **Latency budget:** initial render < 250 ms for ≤512 tokens; interactive updates < 150 ms. Use lazy tensors + downsampling. | |
| * **Reproducibility:** every view binds to a **Run ID**; each action produces a **Replay Script** (YAML) to re‑execute generation/ablations. | |
| * **Privacy:** no proprietary code unless whitelisted; redact file paths; opt‑out for audio/screen capture. | |
| * **Colour semantics:** one consistent palette; uncertainty → desaturated; stronger evidence → higher opacity; avoid misleading rainbows. | |
| ### Core model instrumentation (PyTorch/transformers hooks) | |
| * Capture per‑step: logits, logprobs, entropy; attention tensors `A[L,H,T,T]`; residual norms `||x_l||`; FFN activations (optional SAE features); KV‑cache hits; time per layer. | |
| * Store as memmap/`zarr` with chunking `(layer, head)` to keep interaction snappy. | |
| ### Minimal data contract (per token `t_i`) | |
| ```json | |
| { | |
| "id": 37, | |
| "text": "get_user", | |
| "bpe": ["get", "_", "user"], | |
| "byte_len": 8, | |
| "pos": 37, | |
| "logprob": -0.22, | |
| "entropy": 1.08, | |
| "topk": [{"tok":"(","p":0.21}, {"tok":"_","p":0.18}, {"tok":".","p":0.12}], | |
| "attn_in": {"layer": L, "head": H, "top_sources": [[pos, weight], ...]}, | |
| "residual_norm": 3.7, | |
| "time_ms": 1.8 | |
| } | |
| ``` | |
| --- | |
| ## 1) Attention Visualisation *(descriptive; hypotheses validated via ablation)* | |
| **Purpose (RQ1):** Make cross‑token influence legible; expose head roles; support causal what‑ifs. | |
| ### Primary view | |
| * **Token‑to‑token heatmap** (rows = generated tokens, cols = prompt+context), aggregated or per‑head. Hover a token → highlight top‑k sources; tooltips show exact weights and source spans. | |
| * **Head grid** (Layer × Head matrix): mini‑sparklines per head showing mean attention to classes (delimiters, identifiers, comments). Click → overlays that head on main heatmap. | |
| * **Rollout/flow toggle:** attention rollout (Kovaleva‑style) vs raw attention. | |
| ### Interactions | |
| * **Brush source span** in context → show downstream tokens most impacted (opacity ∝ weight). | |
| * **Compare decode steps:** scrub generation timeline; diff two steps to see shifting sources. | |
| * **Evidence pinning:** pin a pair (source→target) to the **Ablation** pane. | |
| * **Recency bias flag:** Highlight cases where >70% attention mass concentrates on last 5 tokens (recency bias indicator). | |
| ### Algorithms & performance | |
| * Precompute per‑token top‑k sources (k=8). Downsample long contexts with landmark tokens (newline, punctuation, identifiers). WebGL canvas for heat. | |
| ### Validity checks | |
| * Warn if softmax temperature >1.2 or top‑k sampling active (attention interpretability caveat). Display effective context length. | |
| **Note:** Attention visualisation is **descriptive**; causal claims require validation via ablation (Section 3). | |
| --- | |
| ## 2) Token Size & Confidence Visualisation | |
| **Purpose:** Reveal how tokenisation granularity (BPE/SentencePiece) interacts with model uncertainty to signal risk during code generation. | |
| ### Primary view (Token Bar) | |
| * Sequence rendered as **chips**; **width** = byte length (or BPE merge depth), **opacity** = confidence (1−entropy) or `exp(logprob)`. | |
| * **Top‑k alternatives** on click (with probs) and the **source attention snippet** that justified each alternative. | |
| * **Risk hotspot flags:** identifiers split into **≥3 subwords** *and* local **entropy peaks**. | |
| ### Secondary widgets | |
| * **Entropy sparkline** with peaks labelled; toggle to show **calibrated** thresholds for code tokens (keywords/identifiers/operators may differ). | |
| * **Cost/latency estimator:** cumulative decoding time and estimated API‑cost (if remote). | |
| ### Interactions | |
| * Click token → show tokenisation, entropy, top‑k; add as constraint to **Ablation** (force/ban token); jump to **Attention** sources. | |
| * Range‑select tokens → aggregate uncertainty and show correlated attention dispersion. | |
| ### Metrics & study hooks | |
| * **Bug‑risk AUC** for hotspot flags vs actual error locations. | |
| * **Correlation**: token entropy vs unit‑test failure spans; pre‑reg threshold (e.g., entropy ≥ 1.5 nats). | |
| --- | |
| ## 3) Ablation Visualisation | |
| **Purpose (causal):** Show what changes when we disable parts of the architecture or constrain outputs. | |
| ### Scope constraints (for interactivity) | |
| * Expose only **top‑k heads** (e.g., k=20) ranked by rollout/gradient contribution. | |
| * Allow **layer bypass** for ≤2 layers simultaneously. | |
| * Optional **FFN gate clamp** for a single layer. | |
| * Use a **surrogate regressor** to predict Δlog‑prob before running heavy re‑decodes; queue background executions. | |
| ### Controls | |
| * **Head toggles**: Layer×Head matrix with checkboxes (mask to uniform/zero). | |
| * **Layer bypass** and **token constraints** (ban/force). | |
| * **Decoding locks**: temperature/top‑p pinned to baseline. | |
| ### Outputs | |
| * **Unified diff** between baseline and ablated generation. | |
| * **Code‑aware metrics:** unit tests passed, **AST parse success**, static‑analysis warnings (ruff/bandit), and **Δlog‑prob** over altered spans. | |
| * **Per‑token delta heat**: Δlogprob/Δentropy; small multiples for most‑impactful heads. | |
| ### Attribution ground truth (for study) | |
| A source token is influential for a generated token if (i) it lies in the top‑k rollout sources **and** (ii) masking the minimal set of heads that carry that source raises Δlog‑prob ≥ τ (e.g., 0.1) or flips a unit test outcome. | |
| --- | |
| ## 4) Pipeline Visualisation | |
| **Purpose:** Expose model pipeline and attribution of latency/uncertainty across stages using **interpretable layer‑level signals**, not raw neuron heatmaps. | |
| ### Primary view (Swimlane/Timeline) | |
| * Lanes: **Tokeniser → Embeddings → Layers (block‑stack) → Logits → Sampler → Post‑proc/Tests**. | |
| * For each generated token: rectangles whose **length** reflects time per stage; colour intensity = uncertainty (entropy). Hover → per‑stage stats. | |
| ### Layer‑level signals (per token or averaged) | |
| * **Residual‑norm z‑scores** across layers (outlier spikes flagged). | |
| * **Entropy shift** from pre‑ to post‑layer logits. | |
| * **Attention‑flow saturation** (% of attention mass concentrated on top‑m positions). | |
| * **Router load** if MoE: expert IDs + gate weights and imbalance. | |
| ### Interactions | |
| * Click a token → cross‑highlight in **Attention** and **Token Size & Confidence**. | |
| * **Layer bypass** (≤2 at a time) to test where decisions crystallise; show predicted impact first, then execute queued ablation. | |
| ### Operational definitions | |
| * **Bottleneck** = top‑q percentile of per‑layer latency or residual‑norm spikes; correlate with entropy jumps at the sampler. | |
| --- | |
| ## 5) Study mapping (tasks ↔ visualisations ↔ hypotheses) | |
| * **T1 Code completion (5–15 LOC):** Attention helps source‑of‑truth tracing; Token Size flags risky fragments; Ablation confirms causal role; Pipeline shows latency/entropy spikes. | |
| * **T2 Bug fix from failing tests:** Use Attention to localise misleading context; Ablation to test head responsibility; improved pass‑rate/time. | |
| * **T3 API usage w/ docs:** Token Size shows odd fragmentations of identifiers; Attention confirms copying from docs; Pipeline surfaces sampler uncertainty. | |
| ### Measures | |
| * Primary: tests passed, time‑to‑pass, number of ablations invoked, SCS causability score, trust calibration (Brier). | |
| * Secondary: SUS for dashboard, NASA‑TLX, qualitative themes. | |
| --- | |
| ## 6) Telemetry & schema | |
| ### Event types | |
| * `run.start|end`, `token.emit`, `viz.attention.hover`, `viz.token_size.click`, `ablation.run`, `pipeline.hover`, `test.run`. | |
| ### Minimal log rows | |
| ```json | |
| {"event":"token.emit","run":"R2025-10-30-1342","i":37,"tok":"get_user","lp":-0.22,"H":1.08,"time_ms":1.8} | |
| {"event":"ablation.run","mask":[[12,3],[18,7]],"delta":{"tests":-2,"edit_dist":17}} | |
| ``` | |
| ### Storage | |
| * Session JSONL + tensor store (zarr). Export bundle (Run ID, code, tensors, ablation scripts) for reproducibility. | |
| --- | |
| ## 7) Implementation plan (8‑week alpha) | |
| * **Week 1–2 – Instrumentation**: hooks for attention/residuals; tokenizer stats; timing per stage; zarr writer; minimal API. Add rollout and head ranking. | |
| * **Week 3 – Attention view**: heatmap (WebGL), head grid, rollout; cross‑links; disclaimer that attention is descriptive. | |
| * **Week 4 – Token Size & Confidence view**: chip bar, entropy sparkline, hotspot flags, top‑k. | |
| * **Week 5 – Ablation view**: mask top‑k heads/layers; surrogate predictor; diff viewer; code‑aware metrics. | |
| * **Week 6 – Pipeline view**: swimlane with residual‑z, entropy shift, saturation, latency; layer bypass (≤2). | |
| * **Week 7 – Pilot study (n=3)**: tune thresholds (entropy τ, Δlog‑prob τ); validate latency; add warnings/tooltips. | |
| * **Week 8 – Main study tooling**: surveys, Latin‑square, OSF pre‑reg package, export artefact bundle. | |
| --- | |
| ## 8) Validity, pre‑registration & reproducibility | |
| * **Validity note:** Attention visualisation is **descriptive**; causal claims are only made when confirmed via **ablation deltas**. | |
| * **Pre‑registration (OSF):** include task pool, counterbalancing, metrics (AUC/Δlog‑prob/tests), exclusion criteria, mixed‑effects analysis, MDES. | |
| * **Reproducibility:** pin seed/checkpoint; publish tensors + telemetry (JSONL + zarr) and replay scripts; anonymise. | |
| --- | |
| ## 9) Study hypotheses (pre‑reg friendly) | |
| * **H1‑Attn:** Attention+rollout increases correct source identification vs baseline, verified by ablation (OR ≥ 1.8). | |
| * **H2‑Tok:** Entropy×token‑size hotspots predict bug locations (AUC ≥ 0.70) and reduce time‑to‑diagnosis. | |
| * **H3‑Abl:** Ablation tool reduces iterations to a passing solution by ≥20%. | |
| * **H4‑Pipe:** Pipeline summaries improve next‑token prediction and error localisation accuracy. | |
| --- | |
| ## 10) Measurement appendix (formulas) | |
| * **Entropy**: H = −∑_i p_i log p_i (nats). Threshold τ_H pre‑reg. | |
| * **Residual‑norm z**: z_l = (||x_l|| − μ_l)/σ_l over corpus pilot. | |
| * **Attention rollout**: A_roll = softmax(A) composed across layers (Kovaleva‑style). | |
| * **Attribution Δ**: Δ = log p_baseline(tok) − log p_ablated(tok); influential if Δ ≥ τ_Δ. | |
| --- | |
| ## 11) Power & design guardrails | |
| * Within‑subjects, Latin square; difficulty buckets; record order, LLM familiarity, years' experience. | |
| * Plan for **medium effect** (d≈0.5): target n=18–24; if n≤12, emphasise large effects + rich qualitative analysis. | |
| --- | |
| ## Appendix A – Summary Table | |
| | Visualization | Opaque Mechanism | Interpretable Representation | Decision Signal (dev-relevant) | Causal Check | | |
| |--------------|------------------|----------------------------|--------------------------------|--------------| | |
| | **Attention** | Multi-head self-attention | Token→token rollout heatmaps + head-role grid | Which context spans steer each generated token; recency vs long-range use | Verify via head mask ablations | | |
| | **Token Size & Confidence** | Softmax over vocab + BPE splits | Token chips: width=bytes, opacity=confidence, entropy sparkline, top-k | Low-confidence identifiers/API calls; multi-split identifiers as risk | Check error rate vs entropy peaks; ablate to flip token | | |
| | **Ablation** | Component causality (heads/layers/FFN) | Toggle masks + unified diff + Δtests/Δlog-prob | Identify critical vs redundant components; localise bug sources | Intrinsic causal by design | | |
| | **Pipeline** | Layerwise transformation | Layer timeline: residual-norm z, entropy shift, latency, (router load) | Where decisions "crystallise"; where errors emerge | Cross-check with layer bypass deltas | | |
| --- | |
| ## Appendix B – Operational Thresholds | |
| | Parameter | Symbol | Value (Initial) | Tuning Method | | |
| |-----------|--------|----------------|---------------| | |
| | Entropy threshold | τ_H | 1.5 nats | Pilot study (n=3); calibrate to ~90% specificity | | |
| | Log-prob delta | τ_Δ | 0.1 | Ablation sensitivity; adjust for model scale | | |
| | Residual-norm outlier | τ_z | 2.0 σ | Corpus statistics from 100 samples | | |
| | Recency bias threshold | - | 70% | Arbitrary; flag if >70% attention on last 5 tokens | | |
| | Top-k heads | k | 20 | Performance constraint; expand if latency permits | | |
| --- | |
| ## Appendix C – Technical Dependencies | |
| ### Backend (Python) | |
| - PyTorch ≥ 2.0 | |
| - transformers ≥ 4.30 | |
| - zarr ≥ 2.14 | |
| - numpy, scipy | |
| - fastapi, uvicorn | |
| ### Frontend (Next.js) | |
| - React ≥18 | |
| - D3.js or Plotly for visualizations | |
| - WebGL for attention heatmaps | |
| - TailwindCSS for styling | |
| ### Storage | |
| - Zarr arrays for tensors (chunked by layer, head) | |
| - JSONL for telemetry | |
| - YAML for replay scripts | |
| --- | |
| ## Appendix D – OSF Pre‑Registration Template (Ready to Copy) | |
| **Title:** Making Transformer Architecture Transparent for Code Generation: A Developer‑Centric Study of Attention, Token Size & Confidence, Ablation, and Pipeline Visualisations | |
| **Principal Investigator:** Gary Boon (Northumbria University) | |
| **Planned Registration Type:** Pre‑Registration (Confirmatory) | |
| ### 1. Research Questions and Hypotheses | |
| **RQ1:** How can we transform opaque architectural mechanisms into interpretable visual representations that reveal how LLMs make code‑generation decisions? | |
| **Sub‑Hypotheses:** | |
| - **H1‑Attn:** Attention+rollout increases correct source identification vs baseline, verified by ablation (OR ≥ 1.8). | |
| - **H2‑Tok:** Entropy×token‑size hotspots predict bug locations (AUC ≥ 0.70) and reduce time‑to‑diagnosis. | |
| - **H3‑Abl:** Ablation tool reduces iterations to a passing solution by ≥20%. | |
| - **H4‑Pipe:** Pipeline summaries improve next‑token prediction and error localisation accuracy. | |
| ### 2. Design | |
| * **Design Type:** Within‑subjects, Latin square counterbalanced. | |
| * **Conditions:** Baseline (code inspection only) vs Glass‑Box Dashboard (with 4 visualizations). | |
| * **Participants:** n = 18–24 software engineers (2–10 years experience). | |
| * **Tasks:** T1 Code completion (5-15 LOC), T2 Bug fixing from failing tests, T3 API usage with documentation. | |
| * **Covariates:** LLM familiarity (1-7 scale), order (A→B vs B→A), programming language proficiency, years of experience. | |
| ### 3. Materials and Stimuli | |
| * **Model:** Code Llama 7B FP16 (specific checkpoint hash recorded). | |
| * **Visualisations:** Attention (heatmap + head grid), Token Size & Confidence (chip bar + entropy sparkline), Ablation (toggle masks + diff), Pipeline (swimlane timeline). | |
| * **Unit‑test harness:** pytest with pre-written test suites. | |
| * **AST/lint tools:** Python `ast` module, ruff, bandit for static analysis. | |
| ### 4. Procedure | |
| 1. **Consent + pre‑survey** (10 min): demographics, LLM use frequency, programming experience. | |
| 2. **Tutorial on dashboard** (15 min): guided walkthrough of each visualization with example. | |
| 3. **Task blocks** (40 min): counterbalanced order (Latin square); 2-3 tasks per condition. | |
| 4. **Post‑task mini‑survey** (5 min): SCS (System Causability Scale), Trust scale, NASA‑TLX. | |
| 5. **Semi-structured interview** (15 min): qualitative feedback on visualizations, workflow integration. | |
| 6. **Final SUS** (5 min): System Usability Scale for dashboard. | |
| **Total time:** ~90 minutes per participant. | |
| ### 5. Planned Analyses | |
| **Quantitative:** | |
| - **Mixed‑effects models:** condition × task + random intercepts for participant/task. | |
| - **Metrics:** Δlog‑prob (ablation impact), tests passed, time‑to‑fix, AUC(Entropy × Token Size hotspot predictor), OR(H1 - source identification accuracy). | |
| - **Software:** R (lme4) or Python (statsmodels). | |
| **Qualitative:** | |
| - **Thematic analysis:** Braun & Clarke (2021) 6-phase approach. | |
| - **Coding:** Two researchers independently code transcripts; resolve disagreements via discussion. | |
| - **Themes:** Mental model formation, trust calibration, workflow integration, visualization utility. | |
| ### 6. Power Analysis | |
| * **Effect size target:** d = 0.5 (medium effect, Cohen's conventions). | |
| * **α = 0.05, power = 0.8** → n ≈ 21 paired observations (within-subjects). | |
| * **Planned n = 18-24** to account for dropouts and provide adequate power. | |
| ### 7. Data Management | |
| * **Telemetry:** JSONL event logs + zarr tensor storage. | |
| * **Audio/screen captures:** stored on separate encrypted volume; opt-out available. | |
| * **Anonymization:** Participant IDs (P01-P24); redact file paths, proprietary code. | |
| * **Publication:** Anonymised artifacts (Run ID bundles, telemetry, survey data) published on OSF upon paper acceptance. | |
| ### 8. Ethics and Risk | |
| * **Approval:** Northumbria University Ethics Protocol v1.3 (Interpretability Studies). | |
| * **Risk level:** Minimal. Participants can opt-out anytime; no deception involved. | |
| * **Compensation:** £25 Amazon voucher per participant. | |
| ### 9. Exclusion Criteria | |
| * **Pre-registered:** | |
| - < 2 years professional programming experience | |
| - No Python proficiency (self-reported < 4/7) | |
| - Previous participation in pilot study (n=3) | |
| - Incomplete task completion (<50% of tasks) | |
| ### 10. Timeline | |
| * **Pilot study (n=3):** Week 7 of implementation (threshold tuning). | |
| * **Pre-registration submission:** End of Week 7 (before main study). | |
| * **Main study (n=18-24):** Week 8-10. | |
| * **Analysis & write-up:** Week 11-16. | |
| --- | |
| ## Appendix E – Pilot Pack | |
| ### E1. Task T1 – Code Completion | |
| **Prompt:** "Write a Python function `sanitize_sql_like(pattern: str)` that escapes SQL LIKE wildcards (%, _) and backslashes." | |
| **Ground Truth Outline:** | |
| ```python | |
| def sanitize_sql_like(pattern: str) -> str: | |
| pattern = pattern.replace("\\", "\\\\") | |
| pattern = pattern.replace("%", "\\%") | |
| pattern = pattern.replace("_", "\\_") | |
| return pattern | |
| ``` | |
| **Unit Tests (`tests/test_sanitize.py`):** | |
| ```python | |
| from main import sanitize_sql_like | |
| import pytest | |
| def test_escape_percent(): | |
| assert sanitize_sql_like("100%") == "100\\%" | |
| def test_escape_underscore(): | |
| assert sanitize_sql_like("user_name") == "user\\_name" | |
| def test_double_escape(): | |
| assert sanitize_sql_like("C:\\path%") == "C:\\\\path\\%" | |
| ``` | |
| ### E2. Task T2 – Bug Fix (Localisation) | |
| **Prompt:** "This function should reverse a string recursively. Find and fix the bug." | |
| ```python | |
| def reverse_string(s: str) -> str: | |
| if len(s) == 1: | |
| return s | |
| return s[0] + reverse_string(s[1:]) | |
| ``` | |
| **Expected fix:** `return reverse_string(s[1:]) + s[0]` | |
| **Unit Tests (`tests/test_reverse.py`):** | |
| ```python | |
| from main import reverse_string | |
| def test_simple(): | |
| assert reverse_string("abc") == "cba" | |
| def test_empty(): | |
| assert reverse_string("") == "" | |
| ``` | |
| ### E3. Mini‑Survey Items (Per Task) | |
| **7-point Likert scale (1=Strongly Disagree, 7=Strongly Agree):** | |
| 1. I could explain why the model produced this output. | |
| 2. I trusted the model's output appropriately. | |
| 3. My workload was high for this task. | |
| 4. The visualisations were useful for this task. | |
| 5. My confidence was well‑calibrated to the code's correctness. | |
| ### E4. Pilot Checklist | |
| - [ ] Latency < 300 ms mean for ≤512 tokens. | |
| - [ ] Entropy threshold τ_H tuned (~1.5 nats). | |
| - [ ] Δlog‑prob threshold τ_Δ tuned (~0.1). | |
| - [ ] Verify unit tests pass/fail recorded correctly. | |
| - [ ] Survey completion rate ≥ 90%. | |
| - [ ] Qualitative feedback indicates visualizations are understandable. | |
| ### E5. Output Artefacts | |
| **Per participant:** | |
| - `run_pack_P01.zip` → Run ID, tensors (zarr), logs (JSONL), test results, survey responses. | |
| - Import into OSF for data availability statement. | |
| **Aggregate:** | |
| - `pilot_summary.csv` → Metrics, thresholds, latency stats. | |
| - `pilot_feedback.md` → Qualitative themes, suggested improvements. | |
| --- | |
| ## References | |
| - **Jain, S., & Wallace, B. C. (2019).** Attention is not Explanation. *NAACL*. | |
| - **Kou, Z., et al. (2024).** Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code? *FSE*. | |
| - **Paltenghi, M., et al. (2022).** Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration. *arXiv*. | |
| - **Zheng, H., et al. (2025).** Attention Heads of Large Language Models: A Survey. *arXiv*. | |
| - **Zhao, H., et al. (2024).** Explainability for Large Language Models: A Survey. *ACM Digital Library*. | |
| - **Braun, V., & Clarke, V. (2021).** Thematic Analysis: A Practical Guide. *SAGE Publications*. | |
| - **Wang, K., et al. (2022).** Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 small. *arXiv*. | |
| --- | |
| ## Document History | |
| | Version | Date | Changes | Author | | |
| |---------|------|---------|--------| | |
| | 1.0 | 2025-11-01 | Initial specification document | Gary Boon | | |
| --- | |
| **End of Specification Document** | |