Spaces:

visualisable-ai
/

api

Sleeping

api

File size: 38,362 Bytes

37ed739

# RQ1 Mapping: How Each Visualization Addresses Architectural Transparency

**Research Question 1:** "How can we transform opaque architectural mechanisms (multi-head attention, feed-forward networks, mixture-of-experts routing) into interpretable visual representations that reveal how LLMs make code generation decisions?"

**Document Version:** 1.0
**Date:** 2025-11-01
**Author:** Gary Boon, Northumbria University

---

## Executive Summary

This document maps each of the 4 visualizations (Attention, Token Size & Confidence, Ablation, Pipeline) to RQ1, explaining:
1. What opaque mechanism each visualization addresses
2. How it transforms that mechanism into an interpretable representation
3. What code generation decisions it reveals
4. How it extends beyond existing literature
5. Specific research sub-questions for the user study

---

## 1. Attention Visualization (QKV Explorer)

### Opaque Mechanism Addressed

**Multi-head self-attention** - the fundamental mechanism by which transformers weight input tokens when generating each output token.

**Sources of opacity:**
- 32+ heads operating in parallel (Code Llama 7B has 32 heads × 32 layers = 1,024 attention heads)
- High-dimensional attention score matrices (hidden_dim × seq_length)
- Non-interpretable weight distributions across heads
- Unclear semantic specialization of individual heads

### Transformation to Interpretability

**Primary contribution:** Spatial decomposition + interactive querying

1. **Head-level decomposition:** Display each attention head's behavior separately, allowing identification of specialized roles:
   - Syntactic heads focusing on matching brackets, indentation
   - Semantic heads attending to variable definitions, type hints
   - Positional heads capturing code structure (function boundaries, control flow)

2. **Token-to-token attribution:** Interactive heat maps showing which prompt tokens each generated code token attends to, with normalized attention weights (0-1 scale):
   - Rows = generated tokens
   - Columns = prompt + context tokens
   - Heat intensity = attention weight
   - Hover = exact weights + source spans

3. **Attention rollout:** Composition of attention across layers (Kovaleva-style) to show information flow from input to output:
   ```
   A_rollout = A_L × A_(L-1) × ... × A_1
   ```
   This reveals which input tokens contribute to each output token through the entire network stack.

4. **Head role grid:** Layer × Head matrix with mini-sparklines showing mean attention to token classes:
   - Delimiters (brackets, colons, commas)
   - Identifiers (variable names, function names)
   - Keywords (def, class, if, for)
   - Comments (docstrings)

### What Code Generation Decisions It Reveals

**Specific insights for developers:**

1. **Identifier resolution:** When model generates `user.name`, which prior prompt tokens did it attend to?
   - Expected: variable declaration `user = User(...)`, type hints `user: User`, docstrings describing user object
   - Misalignment: over-attending to recent tokens (recency bias) instead of declaration site

2. **Syntactic correctness:** Do specific heads focus on bracket matching, indentation patterns, or control flow structure?
   - Example: Head [Layer 5, Head 3] might specialize in matching opening/closing brackets
   - Example: Head [Layer 8, Head 12] might attend to indentation levels for syntactic consistency

3. **Context utilization:** Is the model actually "reading" the prompt context, or over-attending to recent tokens?
   - Recency bias indicator: >70% attention mass on last 5 tokens
   - Long-range dependency: attention to tokens >100 positions back

4. **Error attribution:** When buggy code is generated, can we trace it to misaligned attention?
   - Example: Model generates `user.get_name()` but should be `user.name` → attention shows model attended to API doc snippet instead of variable declaration
   - Example: Model generates incorrect variable name → attention shows model confused two similar identifiers in context

### Extension Beyond Existing Literature

**Kou et al. (2024): "Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?"**
- Showed attention misalignment with human programmers
- Used aggregate metrics (averaged across heads/layers)
- Post-hoc analysis (no interactive exploration)
- Passive comparison (developers not in control)

**Your extension:**
- **Interactive head selection:** Developer chooses which head/layer to inspect in real-time
- **Code-specific annotations:** Highlight syntactic elements (keywords, identifiers, operators) with domain-specific color coding
- **Counterfactual queries:** "What if I remove this docstring? How does attention redistribute?"
- **Task-embedded evaluation:** Developers use the tool during actual code review tasks (bug detection, prompt optimization), not just correlation studies

**Paltenghi et al. (2022): "Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration"**
- Eye-tracking study comparing developer attention to model attention
- Focus on code exploration, not generation
- No interactive visualization for developers

**Your extension:**
- **Generative focus:** Attention during code generation, not just comprehension
- **Interactive tool:** Developers manipulate and query attention, not just observe
- **Causal validation:** Attention hypotheses validated via ablation (Section 3)

**Zheng et al. (2025): "Attention Heads of Large Language Models: A Survey"**
- Taxonomy of attention head discovery methods:
  1. Model-free (saliency, gradient-based)
  2. Modeling-required (probing classifiers)
- Primarily for ML researchers analyzing models

**Your positioning:**
- **Model-free + developer-in-the-loop:** No additional training, but leverages human domain expertise for interpretation
- **Novel category:** "Developer-driven interpretability" - non-ML-experts can explore attention patterns and form hypotheses about head roles

### Developer-Facing Research Questions

**RQ1.1: Head Role Discovery**
Can developers identify which attention heads are responsible for syntactic correctness vs semantic coherence?

**Hypothesis H1.1:** Developers using the attention visualization will correctly identify:
- Syntactic heads (bracket matching, indentation) with >70% accuracy
- Semantic heads (identifier resolution, type inference) with >60% accuracy
- Measured by: agreement with ground truth head roles (established via ablation studies)

**RQ1.2: Error Prediction**
Does seeing attention distributions improve developers' ability to predict model errors?

**Hypothesis H1.2:** Developers with attention visualization will:
- Predict buggy outputs 25% faster than baseline
- Increase bug detection accuracy by ≥15 percentage points
- Measured by: time to flag suspicious tokens, precision/recall of bug predictions

**RQ1.3: Attention-Expectation Alignment**
How do developers' attention expectations differ from model attention patterns?

**Hypothesis H1.3:** Developers will report misalignment in:
- >40% of generated tokens (model attends to unexpected sources)
- Especially for API usage and rare identifiers
- Measured by: developer annotations of "surprising" attention patterns + post-task interviews

**RQ1.4: Recency Bias Awareness**
Can developers identify when the model exhibits recency bias (over-attending to recent tokens)?

**Hypothesis H1.4:** With recency bias flags (>70% attention on last 5 tokens), developers will:
- Correctly identify recency bias cases with >80% accuracy
- Adjust prompts to mitigate bias in >50% of cases
- Measured by: flag accuracy vs ground truth, prompt modification patterns

---

## 2. Token Size & Confidence Visualization

### Opaque Mechanism Addressed

**Probability distribution over vocabulary** at each decoding step + **tokenization granularity**

**Sources of opacity:**
- 32K-50K vocab size (Code Llama) making full distribution uninterpretable
- Softmax scores calibrated to model's training distribution, not developer confidence
- Tokenization artifacts:
  - `"user"` tokenized as one token vs `"username"` as two tokens `["user", "name"]`
  - Rare identifiers split into nonsensical subwords: `"pytorch"` → `["py", "tor", "ch"]`
- Hidden relationship between entropy and actual error likelihood

### Transformation to Interpretability

**Primary contribution:** Uncertainty quantification + token granularity exposure

1. **Per-token confidence scores:** Display top-k alternatives with probabilities:
   ```
   "for" at 0.89
   "while" at 0.07
   "if" at 0.03
   ```
   This shows model's uncertainty and plausible alternatives.

2. **Entropy-based uncertainty:** Shannon entropy as proxy for model uncertainty:
   ```
   H = -∑ p_i log(p_i)
   ```
   - High entropy = many plausible alternatives (model is guessing)
   - Low entropy = one clear choice (model is confident)

3. **Tokenization visibility:** Show exact token boundaries (BPE/SentencePiece splits) to reveal when model is uncertain due to subword chunking:
   - Visual: token chips with width proportional to byte length
   - Chip color/opacity reflects confidence (desaturated = low confidence)
   - Example: `get_user_data` might be tokenized as `["get", "_user", "_data"]` (3 tokens) vs `["get_user_data"]` (1 token)

4. **Hallucination risk indicators:** Flag tokens with high entropy + low maximum probability:
   - Entropy ≥ τ_H (e.g., 1.5 nats)
   - Max probability < 0.5
   - This indicates model is "guessing" with no clear preference

5. **Risk hotspot flags:** Identifiers split into ≥3 subwords AND entropy peak:
   - These are statistically more likely to be bugs (to be validated in user study)
   - Example: `process_user_data` → `["process", "_user", "_data"]` with H = 1.8 nats → FLAG

### What Code Generation Decisions It Reveals

**Specific insights for developers:**

1. **Variable naming:** When model generates `usr` vs `user`, was this high-confidence choice or arbitrary selection from similar alternatives?
   - Check top-k: if `["usr": 0.51, "user": 0.48]` → model is uncertain
   - Check entropy: if H = 1.2 nats → borderline uncertainty
   - Developer can manually select preferred alternative

2. **API usage:** Does model confidently predict correct method names (e.g., `.append()`) or waver between alternatives (`.add()`, `.push()`, `.insert()`)?
   - Low confidence on API calls → likely hallucination or incorrect usage
   - High confidence on incorrect API → model has learned wrong pattern (training data issue)

3. **Tokenization mismatches:** Does splitting `process_data` into `["process", "_data"]` vs `["process_", "data"]` affect model confidence?
   - Hypothesis: multi-split identifiers correlate with lower confidence
   - Mechanism: model's vocabulary doesn't contain full identifier, so it reconstructs from subwords
   - Developer insight: use simpler identifiers (fewer underscores, camelCase) for better model confidence

4. **Implicit assumptions:** High confidence on incorrect code suggests model has learned wrong patterns:
   - Example: model generates `list.append(x)` with 0.95 confidence, but list is actually a numpy array (should be `np.append(list, x)`)
   - This reveals model's training data bias (more Python lists than numpy arrays in training set)

### Extension Beyond Existing Literature

**Zhao et al. (2024): "Explainability for Large Language Models: A Survey"**
- Covers probability-based explanations but mostly:
  - Aggregate metrics (perplexity, log-likelihood)
  - Not code-specific
  - No tokenization awareness

**Your extension:**
- **Code-aware thresholds:** Calibrate "low confidence" thresholds specifically for code tokens:
  - Keywords (def, class) typically high confidence
  - Identifiers vary (common names high, rare names low)
  - Operators high confidence
  - Different threshold τ_H for each category

- **Tokenization pedagogy:** Educate developers on how BPE affects model's "view" of code:
  - Most code LLM papers (Bistarelli et al., 2025 review) ignore tokenization effects
  - Developers rarely aware that identifier choice affects tokenization
  - Your tool makes this visible → potential prompt engineering insight

- **Alternative exploration:** Let developers click on low-confidence tokens to see *why* alternatives were plausible:
  - Show attention snippet: which context tokens justified each alternative?
  - Link to Attention visualization for deeper investigation

- **Real-time confidence:** Stream confidence scores during generation, not just post-hoc analysis:
  - Developer can interrupt generation if confidence drops below threshold
  - Useful for interactive coding assistants

### Novel Contribution: Tokenization × Confidence Interaction

**Gap in literature:** Most code generation papers ignore tokenization effects. But:
- `variable_name` (snake_case) vs `variableName` (camelCase) tokenized differently → different confidence profiles
- Short vs long identifier names have different entropy characteristics
- Rare API names may be split into nonsensical subwords → low confidence

**Your visualization makes this visible** - potentially novel for code LLM research.

**Hypothesis:** Multi-split identifiers (≥3 subwords) + entropy peaks predict bugs better than entropy alone.

### Developer-Facing Research Questions

**RQ1.5: Confidence-Based Bug Detection**
Can developers use token confidence to identify likely bugs faster than code inspection alone?

**Hypothesis H1.5:** Developers with confidence visualization will:
- Identify bugs 20% faster than baseline
- Increase bug detection precision by ≥10 percentage points
- Measured by: time to identify bug, precision/recall of bug locations

**RQ1.6: Tokenization Awareness**
Does seeing tokenization boundaries change developers' prompt engineering strategies?

**Hypothesis H1.6:** After using token size visualization, developers will:
- Report increased awareness of tokenization (>70% agree in post-survey)
- Adjust identifier naming in prompts (>40% of participants)
- Measured by: survey responses, prompt modification patterns in telemetry

**RQ1.7: Confidence Calibration**
Do high-confidence errors undermine trust more than low-confidence errors?

**Hypothesis H1.7:** Developers will report:
- Lower trust when high-confidence predictions are wrong (≥1 point on 7-point scale)
- Appropriate trust calibration when confidence aligns with correctness
- Measured by: Brier score (calibration metric), trust survey responses

**RQ1.8: Bug-Risk AUC**
Do entropy × token-size hotspot flags predict actual bug locations?

**Hypothesis H1.8 (from spec):** AUC ≥ 0.70 for hotspot predictor vs actual bug locations
- Measured by: ROC curve analysis, ground truth = unit test failures + manual bug annotations

---

## 3. Ablation Visualization

### Opaque Mechanism Addressed

**Causal attribution of model components** - specifically:
- Which attention heads are critical vs redundant?
- Which layers perform feature extraction vs reasoning?
- Which feed-forward networks (FFN) contribute to code-specific decisions?

**Sources of opacity:**
- Distributed computation across 32 layers × 32 heads = 1,024 attention heads (Code Llama 7B)
- Non-linear interactions between components (head X in layer Y may depend on head Z in layer W)
- Unclear redundancy: can model compensate if one head is removed?
- Black-box causality: correlation (attention weights) ≠ causation (actual influence)

### Transformation to Interpretability

**Primary contribution:** Interactive causal intervention + comparative analysis

1. **Selective ablation:** Developer toggles individual heads, entire layers, or FFN blocks off:
   - Head masking: zero out attention weights or set to uniform distribution
   - Layer bypass: skip layer entirely, pass residual stream through unchanged
   - FFN gate clamp: disable feed-forward network in specific layer

2. **Before/after comparison:** Side-by-side display of original output vs ablated output:
   - Unified diff showing changed tokens (color-coded: added/removed/modified)
   - Line-level changes for multi-line code generation
   - Structural changes (AST diff) to show semantic impact

3. **Quantitative impact metrics:**
   - **Token-level change rate:** % tokens that changed after ablation
   - **Semantic similarity:** CodeBLEU, embedding distance (cosine similarity)
   - **Syntactic correctness:** AST parse success (can code be parsed?)
   - **Functional correctness:** Unit tests passed (does code work?)
   - **Static analysis:** ruff/bandit warnings (code quality/security issues)
   - **Δlog-prob:** Change in log-probability of each token

4. **Per-token delta heat:** Visualize Δlog-prob and Δentropy per token:
   - Small multiples showing impact of ablating each of top-k heads
   - Identify most-impactful heads (Δlog-prob ≥ τ_Δ, e.g., 0.1)

5. **Hypothesis testing workflow:**
   - Developer predicts impact before ablation ("I think head [12,5] handles bracket matching")
   - Execute ablation
   - Verify prediction (did brackets break?)
   - Iteratively refine mental model of head roles

### What Code Generation Decisions It Reveals

**Specific insights for developers:**

1. **Critical heads:** Identify which heads, if removed, break code generation entirely:
   - Example: ablating head [Layer 3, Head 7] causes all bracket matching to fail → this head is critical for syntactic correctness
   - Implication: model relies on specific architectural component for basic syntax

2. **Redundant heads:** Which heads can be removed with minimal impact?
   - Example: ablating head [Layer 25, Head 14] changes only 2% of tokens → this head is redundant
   - Implication: model is over-parameterized (could be pruned for efficiency)

3. **Layer specialization:** Early layers (1-8) handle tokenization/syntax, mid layers (9-20) handle semantics, late layers (21-32) handle coherence?
   - Hypothesis to test via layer bypass ablations
   - Example: bypassing layer 5 breaks indentation; bypassing layer 15 breaks variable scoping

4. **Bug localization:** If ablating head X fixes a bug, that head is likely causing the error:
   - Example: model generates `user.get_name()` (wrong) → ablate head [18,3] → model generates `user.name` (correct)
   - Causal diagnosis: head [18,3] is attending to incorrect API documentation context

### Extension Beyond Existing Literature

**Mechanistic interpretability literature (Wang et al., 2022 on GPT-2 circuits):**
- Focuses on individual mechanisms (e.g., indirect object identification circuit)
- Requires manual circuit discovery by ML researchers (slow, expert-driven)
- Not interactive or developer-facing

**Your extension:**
- **Developer-driven exploration:** Non-experts (software engineers) can perform ablations without ML knowledge
- **Code generation focus:** Ablations tailored to code tasks (syntactic correctness, API usage, variable scoping)
- **Real-time feedback:** Immediate re-generation with ablated model (not batch analysis)
- **Task-oriented ablation:** During bug fixing, developer can ablate to localize error source ("Which component is causing this bug?")

**Bansal et al. (2022): "Rethinking the Role of Scale for In-Context Learning"**
- Analyzed layer contributions to ICL via interventions
- Focused on language tasks (not code)
- No interactive visualization for non-ML-experts

**Your extension:**
- **Interactive ablation:** Developer controls which components to ablate
- **Code-specific metrics:** Unit tests, AST parse, lints (not just perplexity)
- **Hypothesis-driven workflow:** Developer predicts impact before seeing result

### Novel Contribution: Ablation as Debugging Tool

**Gap in literature:** Ablation studies are typically **research tools** (for ML researchers analyzing models), not **developer tools** (for software engineers using models).

**Your contribution:** Reframe ablation as **interactive debugging**:
- "Why did the model generate this bug?" → "Let me turn off components until it works correctly" → identifies faulty component
- This is analogous to debuggers for traditional code (set breakpoints, step through execution)
- But for neural networks: "ablation breakpoints" (turn off heads/layers), "step through architecture" (layer-by-layer pipeline)

**Potential impact:**
- Developers without ML training can perform causal analysis
- Faster bug diagnosis in LLM-generated code
- Insights for model developers (which components are most critical for code generation?)

### Attribution Ground Truth (Methodology)

A source token T_src is "influential" for generated token T_gen if:
1. T_src lies in top-k rollout sources (from Attention Visualization, k=8)
2. Masking the minimal set of heads H that carry attention from T_src → T_gen causes:
   - Δlog-prob ≥ τ_Δ (e.g., 0.1) on T_gen, OR
   - Flip in unit test outcome (pass → fail or vice versa)

This operational definition enables:
- Reproducible measurement of "attribution accuracy"
- Validation of attention-based hypotheses via ablation
- Inter-rater reliability (two researchers apply same criteria)

### Developer-Facing Research Questions

**RQ1.9: Ablation-Assisted Debugging**
Can developers without ML expertise successfully use ablation to identify causes of buggy code generation?

**Hypothesis H1.9:** Developers using ablation tool will:
- Correctly identify causal components (head/layer causing bug) in >60% of cases
- Reduce time to diagnose bug by ≥25% vs baseline
- Measured by: success rate of causal identification, time to diagnosis

**RQ1.10: Mental Model Formation**
Do developers form accurate mental models of layer/head specialization after using ablation tool?

**Hypothesis H1.10:** After ablation exploration, developers will:
- Correctly categorize heads as syntactic/semantic/positional with >65% accuracy
- Describe layer roles (early=syntax, mid=semantics, late=coherence) with >70% agreement
- Measured by: post-task categorization quiz, qualitative interview themes

**RQ1.11: Iteration Reduction**
Does ablation tool reduce iterations needed to achieve passing solution?

**Hypothesis H1.11 (from spec):** Ablation tool reduces iterations to passing solution by ≥20%
- Measured by: number of prompt modifications + code edits before all unit tests pass

**RQ1.12: Causal vs Descriptive Understanding**
Do developers distinguish between correlation (attention) and causation (ablation)?

**Hypothesis H1.12:** Developers will:
- Request ablation validation for >50% of attention-based hypotheses
- Report understanding that "attention ≠ causation" (>80% agreement in survey)
- Measured by: telemetry (how often developers cross-reference Attention + Ablation), survey responses

---

## 4. Pipeline Visualization

### Opaque Mechanism Addressed

**Layer-by-layer representation transformation** - the "forward pass" through 32 transformer layers where:
- Input embeddings gradually transform into output logits
- Each layer applies: self-attention → FFN → layer norm → residual connection
- Intermediate representations are high-dimensional (hidden_dim = 4096 for Code Llama 7B) and semantically opaque

**Sources of opacity:**
- No visibility into intermediate states (black box from input → output)
- Unclear where "understanding" emerges (early vs late layers?)
- Unknown bottlenecks (which layers struggle most? where does model get confused?)
- Residual connections create complex information flow (not simple feedforward)

### Transformation to Interpretability

**Primary contribution:** Temporal decomposition + interpretable layer-level signals

1. **Layer-by-layer scrubbing:** Timeline UI to "scrub" through layers 0→32, showing how representations evolve:
   - Visualize as swimlane: horizontal axis = layers, vertical axis = tokens
   - Each "swim" represents one token's journey through the architecture
   - Color intensity = uncertainty (entropy) at that layer

2. **Interpretable signals (not raw activations):**
   - **Residual-norm z-scores:** How much each layer changes the representation
     ```
     z_l = (||x_l|| - μ_l) / σ_l
     ```
     - High z → layer is "working hard" (significant transformation)
     - Low z → layer passes information through with minimal change

   - **Entropy shift:** Change in output entropy from pre- to post-layer
     ```
     ΔH_l = H(logits after layer l) - H(logits before layer l)
     ```
     - Negative ΔH → layer reduces uncertainty (good)
     - Positive ΔH → layer increases uncertainty (confusion)

   - **Attention-flow saturation:** % of attention mass concentrated on top-m positions
     ```
     Saturation = ∑(top-m attention weights) / ∑(all attention weights)
     ```
     - High saturation → focused attention (model is certain about sources)
     - Low saturation → diffuse attention (model is uncertain)

   - **Router load (MoE only):** Which experts activate in mixture-of-experts layers
     - Expert IDs + gate weights
     - Imbalance metric (are all experts used equally?)

3. **Swimlane/Timeline view:**
   - Lanes: Tokenizer → Embeddings → Layer 1 → ... → Layer 32 → Logits → Sampler → Post-proc/Tests
   - Rectangle length = time per stage (latency profiling)
   - Color = uncertainty (entropy)
   - Hover = per-stage stats (residual-z, ΔH, saturation, latency)

4. **Bottleneck identification:**
   - Flag layers in top-q percentile (e.g., top 10%) of:
     - Latency (slowest layers)
     - Residual-norm spikes (largest transformations)
     - Entropy jumps (biggest increases in uncertainty)
   - Correlate bottlenecks with sampler behavior (does entropy spike → hallucination?)

### What Code Generation Decisions It Reveals

**Specific insights for developers:**

1. **Emergence of syntax:** At which layer does model "realize" it's generating a function?
   - Likely when indentation pattern appears, `def` keyword generated
   - Measure: residual-norm spike at layer where syntactic structure emerges
   - Example: Layer 5 shows high residual-z when generating `def factorial(n):`

2. **Semantic shift:** Can we observe when model transitions from "reading prompt" (early layers) to "generating code" (late layers)?
   - Early layers: high attention to prompt tokens, low residual-norm
   - Mid layers: residual-norm increases (processing semantics)
   - Late layers: attention shifts to recent generated tokens (auto-regressive generation)

3. **Error propagation:** If model generates bug at token T, can we trace back to which layer introduced the error?
   - Look for entropy spike or residual-norm anomaly in layers before T
   - Example: Model generates wrong variable name at token 50 → entropy jumps at layer 18 → investigate what happened at layer 18

4. **Compute allocation:** Which layers consume most compute? (Implications for model optimization)
   - Latency profiling shows bottleneck layers
   - Pruning candidates: layers with low residual-norm (minimal transformation) + high latency

### Extension Beyond Existing Literature

**Bansal et al. (2022) on in-context learning at 66B scale:**
- Analyzed layer contributions to ICL via interventions
- Focused on language tasks (not code)
- No interactive visualization for non-ML-experts
- Static analysis (not real-time exploration)

**Your extension:**
- **Code-specific annotations:** Label layers with code-relevant milestones:
  - "Layer 8: syntax tree formed"
  - "Layer 20: variable scope resolved"
  - "Layer 28: stylistic formatting applied"
- **Multi-token tracking:** Show pipeline evolution across multiple generated tokens (not just one forward pass)
- **Developer-friendly abstractions:** Avoid technical jargon (hidden states, residual stream) → use "understanding evolution", "decision stages"
- **Comparative pipelines:** Show pipeline for correct vs buggy outputs side-by-side (where do they diverge?)

**Interpretability papers (general):**
- Focus on probing classifiers to test "what does layer X know?"
- Require training additional models (probes)
- Not interactive or real-time

**Your extension:**
- **No additional training:** Use intrinsic signals (residual-norm, entropy)
- **Real-time:** Compute signals during generation (< 10ms overhead)
- **Actionable:** Developer can bypass layers to test hypotheses

### Novel Contribution: Layer-Level Taxonomy for Code Generation

**Gap in literature:** No established taxonomy of what each transformer layer does during **code generation** specifically.

- Zheng et al. (2025) survey attention heads, but not layer-level roles
- Interpretability papers focus on language tasks (next-word prediction, sentiment, Q&A)
- Code generation is different: requires syntax, semantics, formatting, executable correctness

**Your contribution:** Empirically identify layer specialization for code:
1. **Layers 1-5: Tokenization + basic syntax**
   - Residual-norm spikes when processing delimiters, keywords
   - Attention focuses on local syntax (brackets, colons)

2. **Layers 6-15: Semantic understanding**
   - Residual-norm increases during identifier resolution
   - Attention to variable declarations, type hints, docstrings
   - Entropy decreases (model becomes more certain about semantics)

3. **Layers 16-25: Reasoning/logic**
   - Residual-norm spikes during control flow generation (if/else, loops)
   - Attention to prompt logic + recent generated code
   - Entropy may increase temporarily (exploring logical alternatives)

4. **Layers 26-32: Fluency/formatting**
   - Low residual-norm (minor refinements)
   - Attention to recent tokens (auto-regressive)
   - Entropy decreases (finalizing token choices)

**If validated, this would be novel for code LLMs and could be Paper 1 contribution.**

### Developer-Facing Research Questions

**RQ1.13: Layer Decision Identification**
Can developers identify at which layer the model "decides" on code structure (e.g., loop vs conditional)?

**Hypothesis H1.13:** Developers using pipeline visualization will:
- Correctly identify decision layer within ±3 layers in >55% of cases
- Report increased understanding of model's "thinking process" (>75% agreement)
- Measured by: layer identification accuracy (ground truth = residual-norm + entropy spike analysis), survey responses

**RQ1.14: Next-Token Prediction Improvement**
Does seeing pipeline evolution improve developers' ability to predict subsequent tokens?

**Hypothesis H1.14 (from spec):** Pipeline summaries improve next-token prediction accuracy
- Developers predict next token after seeing pipeline → compare with baseline (no pipeline)
- Expected improvement: +10-15 percentage points in top-3 accuracy
- Measured by: prediction task (5 examples per participant)

**RQ1.15: Error Localization**
Can developers use pipeline visualization to diagnose *where* in the model an error originates?

**Hypothesis H1.15:** Developers will:
- Identify error-causing layer within ±5 layers in >50% of cases
- Reduce time to diagnose error source by ≥20% vs baseline
- Measured by: layer identification accuracy, time to diagnosis

**RQ1.16: Actionable Insights for Prompting**
Can developers use layer knowledge to improve prompts?

**Hypothesis H1.16:** After seeing pipeline, developers will:
- Adjust prompts to provide more context for early layers (syntax/semantics) in >30% of cases
- Report understanding of "what the model needs" (>70% agreement)
- Measured by: prompt modification patterns in telemetry, survey responses

---

## Cross-Cutting Contributions

### 1. Unified Glass-Box Dashboard

**Gap in literature:** Prior work (Kou et al., Paltenghi et al., Zhao et al.) focuses on **single mechanisms** in isolation.

**Your dashboard integrates:**
- **Attention** (spatial attribution)
- **Token Size & Confidence** (probabilistic uncertainty + tokenization)
- **Ablation** (causal attribution)
- **Pipeline** (temporal evolution)

**Developer can triangulate across multiple lenses:**
- Example: "Low confidence + scattered attention + early-layer bottleneck → likely hallucination"
- Example: "High confidence + focused attention + but ablating head X fixes bug → head X is overriding correct information"

**This holistic view is novel for code generation interpretability.**

### 2. Task-Based Developer Study

**Gap:** Most interpretability papers evaluate on:
- Synthetic tasks (toy models, simple examples)
- Researcher-driven analysis (no end-users)
- Post-hoc metrics (accuracy, perplexity)

**Your study evaluates with:**
- **~10 software engineers** doing realistic code tasks (bug detection, code review, prompt optimization)
- **In-the-loop**: Developers use visualizations during task (not passive observation)
- **Actionable interpretability**: Measure whether visualizations improve task performance (time, accuracy, trust)

**This is HCI-grounded interpretability research**, not just ML analysis.

### 3. Code Generation Domain Specificity

**Gap:** Explainability surveys (Zhao et al.) are domain-agnostic. Code has unique properties:
- **Syntactic correctness is binary** (parsable or not) → enables AST-based metrics
- **Semantic correctness is testable** (unit tests) → enables test-based metrics
- **Developer expertise varies** (junior vs senior) → enables expertise-based analysis

**Your visualizations tailored to code:**
- **Syntax highlighting** in attention maps (keywords, identifiers, operators color-coded)
- **Tokenization awareness** for identifiers (rare in NLP interpretability)
- **Ablation targeting code-specific heads** (bracket matching, indentation, API usage)
- **Pipeline stages mapped to code generation phases** (syntax → semantics → logic → formatting)

### 4. Interventionist Interpretability

**Gap:** Most explainability tools are **passive** (show model behavior).

**Your dashboard is **active**:**
- **Ablation allows causal intervention** ("What if I remove this head?")
- **Confidence allows alternative exploration** ("What else could the model have generated?")
- **Pipeline allows temporal investigation** ("Where did the model's understanding emerge?")

**Developers don't just observe - they manipulate and test hypotheses.**

**This is closer to scientist-model interaction (hypothesis-driven) than user-model consumption (passive).**

---

## Literature Positioning Summary

| Your Contribution | Related Work | Gap You Address |
|-------------------|--------------|-----------------|
| **Attention Viz** | Kou et al. (2024) - attention alignment | Interactive, per-head, code-specific, hypothesis-driven |
| **Token Confidence** | Zhao et al. (2024) - prob explanations | Tokenization awareness, code thresholds, bug prediction |
| **Ablation Viz** | Wang et al. (2022) - mechanistic interpretability | Developer-facing, real-time, code metrics (tests/AST) |
| **Pipeline Viz** | Bansal et al. (2022) - layer interventions | Code-specific stages, interpretable signals, interactive |
| **Unified Dashboard** | - | First multi-mechanism glass-box for code LLMs |
| **Developer Study** | Paltenghi et al. (2022) - eye-tracking | Task-based, in-the-loop, actionable metrics |
| **Code Specificity** | - | Syntax/test metrics, tokenization, developer expertise |
| **Interventionist** | - | Ablation, alternatives, hypothesis testing |

---

## Thesis Structure Suggestions

### Chapter 1: Introduction
- **Motivation:** Developers treat LLMs as black boxes → trust issues, debugging difficulties
- **Gap:** Prior work lacks interactive, developer-facing, multi-mechanism dashboards for code
- **Contribution:** First glass-box dashboard integrating 4 interpretability lenses + developer study

### Chapter 2: Literature Review
- **Section 2.1:** Attention in LLMs (Zheng et al., Kou et al.)
- **Section 2.2:** Explainability methods (Zhao et al.)
- **Section 2.3:** Code generation LLMs (Bistarelli et al.)
- **Section 2.4:** Developer-AI interaction (Paltenghi et al.)
- **Section 2.5:** Mechanistic interpretability (Wang et al., Bansal et al.)

### Chapter 3: Methodology (RQ1 Focus)
- **Section 3.1:** Attention Visualization
- **Section 3.2:** Token Size & Confidence Visualization
- **Section 3.3:** Ablation Visualization
- **Section 3.4:** Pipeline Visualization
- **Section 3.5:** Dashboard Integration

### Chapter 4: User Study Design
- **Section 4.1:** Participants (n=18-24 software engineers)
- **Section 4.2:** Tasks (T1, T2, T3)
- **Section 4.3:** Metrics (quantitative + qualitative)
- **Section 4.4:** Protocol (within-subjects, Latin square)

### Chapter 5: Results
- **Section 5.1:** RQ1.1-RQ1.4 (Attention)
- **Section 5.2:** RQ1.5-RQ1.8 (Token Confidence)
- **Section 5.3:** RQ1.9-RQ1.12 (Ablation)
- **Section 5.4:** RQ1.13-RQ1.16 (Pipeline)
- **Section 5.5:** Cross-Cutting Themes

### Chapter 6: Discussion
- **Section 6.1:** Interpretability for Developers (not just researchers)
- **Section 6.2:** Code-Specific Insights (tokenization, syntax, tests)
- **Section 6.3:** Limitations & Future Work

### Chapter 7: Conclusion
- **Summary of Contributions**
- **Implications for Practice** (tool design for developers)
- **Implications for Research** (novel layer taxonomy, ablation as debugging)

---

## ICML Paper 1 Suggestions

**Title:** "Making Transformer Architecture Transparent for Code Generation: A Developer-Centric Study"

**Abstract Structure:**
1. **Problem:** Developers use code LLMs as black boxes → trust/debugging issues
2. **Gap:** Prior interpretability work not developer-facing or code-specific
3. **Solution:** Glass-box dashboard with 4 visualizations (Attention, Token Confidence, Ablation, Pipeline)
4. **Study:** n=18-24 software engineers on 3 code tasks
5. **Results:** (placeholder for actual results)
   - Attention viz improves source identification (H1-Attn)
   - Token confidence flags predict bugs (H2-Tok, AUC ≥ 0.70)
   - Ablation reduces debugging iterations (H3-Abl, -20%)
   - Pipeline improves error localization (H4-Pipe)
6. **Contribution:** First empirical evidence that multi-mechanism interpretability tools improve developer performance on code tasks

**Sections:**
1. Introduction
2. Related Work
3. Dashboard Design (4 visualizations)
4. User Study
5. Results
6. Discussion
7. Conclusion

**Target:** ICML 2026 (submission ~January 2026)

---

**End of RQ1 Mapping Document**