Spaces:

visualisable-ai
/

api

Sleeping

App Files Files Community

api / docs /rq1-mapping.md

gary-boon

Add research attention analysis endpoints with Q/K/V extraction

37ed739 about 1 month ago

preview code

raw

history blame

38.4 kB

RQ1 Mapping: How Each Visualization Addresses Architectural Transparency

Research Question 1: "How can we transform opaque architectural mechanisms (multi-head attention, feed-forward networks, mixture-of-experts routing) into interpretable visual representations that reveal how LLMs make code generation decisions?"

Document Version: 1.0 Date: 2025-11-01 Author: Gary Boon, Northumbria University

Executive Summary

This document maps each of the 4 visualizations (Attention, Token Size & Confidence, Ablation, Pipeline) to RQ1, explaining:

What opaque mechanism each visualization addresses
How it transforms that mechanism into an interpretable representation
What code generation decisions it reveals
How it extends beyond existing literature
Specific research sub-questions for the user study

1. Attention Visualization (QKV Explorer)

Opaque Mechanism Addressed

Multi-head self-attention - the fundamental mechanism by which transformers weight input tokens when generating each output token.

Sources of opacity:

32+ heads operating in parallel (Code Llama 7B has 32 heads × 32 layers = 1,024 attention heads)
High-dimensional attention score matrices (hidden_dim × seq_length)
Non-interpretable weight distributions across heads
Unclear semantic specialization of individual heads

Transformation to Interpretability

Primary contribution: Spatial decomposition + interactive querying

Head-level decomposition: Display each attention head's behavior separately, allowing identification of specialized roles:
- Syntactic heads focusing on matching brackets, indentation
- Semantic heads attending to variable definitions, type hints
- Positional heads capturing code structure (function boundaries, control flow)
Token-to-token attribution: Interactive heat maps showing which prompt tokens each generated code token attends to, with normalized attention weights (0-1 scale):
- Rows = generated tokens
- Columns = prompt + context tokens
- Heat intensity = attention weight
- Hover = exact weights + source spans
Attention rollout: Composition of attention across layers (Kovaleva-style) to show information flow from input to output:
```
A_rollout = A_L × A_(L-1) × ... × A_1
```
This reveals which input tokens contribute to each output token through the entire network stack.
Head role grid: Layer × Head matrix with mini-sparklines showing mean attention to token classes:
- Delimiters (brackets, colons, commas)
- Identifiers (variable names, function names)
- Keywords (def, class, if, for)
- Comments (docstrings)

What Code Generation Decisions It Reveals

Specific insights for developers:

Identifier resolution: When model generates user.name, which prior prompt tokens did it attend to?
- Expected: variable declaration user = User(...), type hints user: User, docstrings describing user object
- Misalignment: over-attending to recent tokens (recency bias) instead of declaration site
Syntactic correctness: Do specific heads focus on bracket matching, indentation patterns, or control flow structure?
- Example: Head [Layer 5, Head 3] might specialize in matching opening/closing brackets
- Example: Head [Layer 8, Head 12] might attend to indentation levels for syntactic consistency
Context utilization: Is the model actually "reading" the prompt context, or over-attending to recent tokens?
- Recency bias indicator: >70% attention mass on last 5 tokens
- Long-range dependency: attention to tokens >100 positions back
Error attribution: When buggy code is generated, can we trace it to misaligned attention?
- Example: Model generates user.get_name() but should be user.name → attention shows model attended to API doc snippet instead of variable declaration
- Example: Model generates incorrect variable name → attention shows model confused two similar identifiers in context

Extension Beyond Existing Literature

Kou et al. (2024): "Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?"

Showed attention misalignment with human programmers
Used aggregate metrics (averaged across heads/layers)
Post-hoc analysis (no interactive exploration)
Passive comparison (developers not in control)

Your extension:

Interactive head selection: Developer chooses which head/layer to inspect in real-time
Code-specific annotations: Highlight syntactic elements (keywords, identifiers, operators) with domain-specific color coding
Counterfactual queries: "What if I remove this docstring? How does attention redistribute?"
Task-embedded evaluation: Developers use the tool during actual code review tasks (bug detection, prompt optimization), not just correlation studies

Paltenghi et al. (2022): "Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration"

Eye-tracking study comparing developer attention to model attention
Focus on code exploration, not generation
No interactive visualization for developers

Your extension:

Generative focus: Attention during code generation, not just comprehension
Interactive tool: Developers manipulate and query attention, not just observe
Causal validation: Attention hypotheses validated via ablation (Section 3)

Zheng et al. (2025): "Attention Heads of Large Language Models: A Survey"

Taxonomy of attention head discovery methods:
1. Model-free (saliency, gradient-based)
2. Modeling-required (probing classifiers)
Primarily for ML researchers analyzing models

Your positioning:

Model-free + developer-in-the-loop: No additional training, but leverages human domain expertise for interpretation
Novel category: "Developer-driven interpretability" - non-ML-experts can explore attention patterns and form hypotheses about head roles

Developer-Facing Research Questions

RQ1.1: Head Role Discovery Can developers identify which attention heads are responsible for syntactic correctness vs semantic coherence?

Hypothesis H1.1: Developers using the attention visualization will correctly identify:

Syntactic heads (bracket matching, indentation) with >70% accuracy
Semantic heads (identifier resolution, type inference) with >60% accuracy
Measured by: agreement with ground truth head roles (established via ablation studies)

RQ1.2: Error Prediction Does seeing attention distributions improve developers' ability to predict model errors?

Hypothesis H1.2: Developers with attention visualization will:

Predict buggy outputs 25% faster than baseline
Increase bug detection accuracy by ≥15 percentage points
Measured by: time to flag suspicious tokens, precision/recall of bug predictions

RQ1.3: Attention-Expectation Alignment How do developers' attention expectations differ from model attention patterns?

Hypothesis H1.3: Developers will report misalignment in:

40% of generated tokens (model attends to unexpected sources)
Especially for API usage and rare identifiers
Measured by: developer annotations of "surprising" attention patterns + post-task interviews

RQ1.4: Recency Bias Awareness Can developers identify when the model exhibits recency bias (over-attending to recent tokens)?

Hypothesis H1.4: With recency bias flags (>70% attention on last 5 tokens), developers will:

Correctly identify recency bias cases with >80% accuracy
Adjust prompts to mitigate bias in >50% of cases
Measured by: flag accuracy vs ground truth, prompt modification patterns

2. Token Size & Confidence Visualization

Opaque Mechanism Addressed

Probability distribution over vocabulary at each decoding step + tokenization granularity

Sources of opacity:

32K-50K vocab size (Code Llama) making full distribution uninterpretable
Softmax scores calibrated to model's training distribution, not developer confidence
Tokenization artifacts:
- "user" tokenized as one token vs "username" as two tokens ["user", "name"]
- Rare identifiers split into nonsensical subwords: "pytorch" → ["py", "tor", "ch"]
Hidden relationship between entropy and actual error likelihood

Transformation to Interpretability

Primary contribution: Uncertainty quantification + token granularity exposure

Per-token confidence scores: Display top-k alternatives with probabilities:
```
"for" at 0.89
"while" at 0.07
"if" at 0.03
```
This shows model's uncertainty and plausible alternatives.
Entropy-based uncertainty: Shannon entropy as proxy for model uncertainty:
```
H = -∑ p_i log(p_i)
```
- High entropy = many plausible alternatives (model is guessing)
- Low entropy = one clear choice (model is confident)
Tokenization visibility: Show exact token boundaries (BPE/SentencePiece splits) to reveal when model is uncertain due to subword chunking:
- Visual: token chips with width proportional to byte length
- Chip color/opacity reflects confidence (desaturated = low confidence)
- Example: get_user_data might be tokenized as ["get", "_user", "_data"] (3 tokens) vs ["get_user_data"] (1 token)
Hallucination risk indicators: Flag tokens with high entropy + low maximum probability:
- Entropy ≥ τ_H (e.g., 1.5 nats)
- Max probability < 0.5
- This indicates model is "guessing" with no clear preference
Risk hotspot flags: Identifiers split into ≥3 subwords AND entropy peak:
- These are statistically more likely to be bugs (to be validated in user study)
- Example: process_user_data → ["process", "_user", "_data"] with H = 1.8 nats → FLAG

What Code Generation Decisions It Reveals

Specific insights for developers:

Variable naming: When model generates usr vs user, was this high-confidence choice or arbitrary selection from similar alternatives?
- Check top-k: if ["usr": 0.51, "user": 0.48] → model is uncertain
- Check entropy: if H = 1.2 nats → borderline uncertainty
- Developer can manually select preferred alternative
API usage: Does model confidently predict correct method names (e.g., .append()) or waver between alternatives (.add(), .push(), .insert())?
- Low confidence on API calls → likely hallucination or incorrect usage
- High confidence on incorrect API → model has learned wrong pattern (training data issue)
Tokenization mismatches: Does splitting process_data into ["process", "_data"] vs ["process_", "data"] affect model confidence?
- Hypothesis: multi-split identifiers correlate with lower confidence
- Mechanism: model's vocabulary doesn't contain full identifier, so it reconstructs from subwords
- Developer insight: use simpler identifiers (fewer underscores, camelCase) for better model confidence
Implicit assumptions: High confidence on incorrect code suggests model has learned wrong patterns:
- Example: model generates list.append(x) with 0.95 confidence, but list is actually a numpy array (should be np.append(list, x))
- This reveals model's training data bias (more Python lists than numpy arrays in training set)

Extension Beyond Existing Literature

Zhao et al. (2024): "Explainability for Large Language Models: A Survey"

Covers probability-based explanations but mostly:
- Aggregate metrics (perplexity, log-likelihood)
- Not code-specific
- No tokenization awareness

Your extension:

Code-aware thresholds: Calibrate "low confidence" thresholds specifically for code tokens:
- Keywords (def, class) typically high confidence
- Identifiers vary (common names high, rare names low)
- Operators high confidence
- Different threshold τ_H for each category
Tokenization pedagogy: Educate developers on how BPE affects model's "view" of code:
- Most code LLM papers (Bistarelli et al., 2025 review) ignore tokenization effects
- Developers rarely aware that identifier choice affects tokenization
- Your tool makes this visible → potential prompt engineering insight
Alternative exploration: Let developers click on low-confidence tokens to see why alternatives were plausible:
- Show attention snippet: which context tokens justified each alternative?
- Link to Attention visualization for deeper investigation
Real-time confidence: Stream confidence scores during generation, not just post-hoc analysis:
- Developer can interrupt generation if confidence drops below threshold
- Useful for interactive coding assistants

Novel Contribution: Tokenization × Confidence Interaction

Gap in literature: Most code generation papers ignore tokenization effects. But:

variable_name (snake_case) vs variableName (camelCase) tokenized differently → different confidence profiles
Short vs long identifier names have different entropy characteristics
Rare API names may be split into nonsensical subwords → low confidence

Your visualization makes this visible - potentially novel for code LLM research.

Hypothesis: Multi-split identifiers (≥3 subwords) + entropy peaks predict bugs better than entropy alone.

Developer-Facing Research Questions

RQ1.5: Confidence-Based Bug Detection Can developers use token confidence to identify likely bugs faster than code inspection alone?

Hypothesis H1.5: Developers with confidence visualization will:

Identify bugs 20% faster than baseline
Increase bug detection precision by ≥10 percentage points
Measured by: time to identify bug, precision/recall of bug locations

RQ1.6: Tokenization Awareness Does seeing tokenization boundaries change developers' prompt engineering strategies?

Hypothesis H1.6: After using token size visualization, developers will:

Report increased awareness of tokenization (>70% agree in post-survey)
Adjust identifier naming in prompts (>40% of participants)
Measured by: survey responses, prompt modification patterns in telemetry

RQ1.7: Confidence Calibration Do high-confidence errors undermine trust more than low-confidence errors?

Hypothesis H1.7: Developers will report:

Lower trust when high-confidence predictions are wrong (≥1 point on 7-point scale)
Appropriate trust calibration when confidence aligns with correctness
Measured by: Brier score (calibration metric), trust survey responses

RQ1.8: Bug-Risk AUC Do entropy × token-size hotspot flags predict actual bug locations?

Hypothesis H1.8 (from spec): AUC ≥ 0.70 for hotspot predictor vs actual bug locations

Measured by: ROC curve analysis, ground truth = unit test failures + manual bug annotations

3. Ablation Visualization

Opaque Mechanism Addressed

Causal attribution of model components - specifically:

Which attention heads are critical vs redundant?
Which layers perform feature extraction vs reasoning?
Which feed-forward networks (FFN) contribute to code-specific decisions?

Sources of opacity:

Distributed computation across 32 layers × 32 heads = 1,024 attention heads (Code Llama 7B)
Non-linear interactions between components (head X in layer Y may depend on head Z in layer W)
Unclear redundancy: can model compensate if one head is removed?
Black-box causality: correlation (attention weights) ≠ causation (actual influence)

Transformation to Interpretability

Primary contribution: Interactive causal intervention + comparative analysis

Selective ablation: Developer toggles individual heads, entire layers, or FFN blocks off:
- Head masking: zero out attention weights or set to uniform distribution
- Layer bypass: skip layer entirely, pass residual stream through unchanged
- FFN gate clamp: disable feed-forward network in specific layer
Before/after comparison: Side-by-side display of original output vs ablated output:
- Unified diff showing changed tokens (color-coded: added/removed/modified)
- Line-level changes for multi-line code generation
- Structural changes (AST diff) to show semantic impact
Quantitative impact metrics:
- Token-level change rate: % tokens that changed after ablation
- Semantic similarity: CodeBLEU, embedding distance (cosine similarity)
- Syntactic correctness: AST parse success (can code be parsed?)
- Functional correctness: Unit tests passed (does code work?)
- Static analysis: ruff/bandit warnings (code quality/security issues)
- Δlog-prob: Change in log-probability of each token
Per-token delta heat: Visualize Δlog-prob and Δentropy per token:
- Small multiples showing impact of ablating each of top-k heads
- Identify most-impactful heads (Δlog-prob ≥ τ_Δ, e.g., 0.1)
Hypothesis testing workflow:
- Developer predicts impact before ablation ("I think head [12,5] handles bracket matching")
- Execute ablation
- Verify prediction (did brackets break?)
- Iteratively refine mental model of head roles

What Code Generation Decisions It Reveals

Specific insights for developers:

Critical heads: Identify which heads, if removed, break code generation entirely:
- Example: ablating head [Layer 3, Head 7] causes all bracket matching to fail → this head is critical for syntactic correctness
- Implication: model relies on specific architectural component for basic syntax
Redundant heads: Which heads can be removed with minimal impact?
- Example: ablating head [Layer 25, Head 14] changes only 2% of tokens → this head is redundant
- Implication: model is over-parameterized (could be pruned for efficiency)
Layer specialization: Early layers (1-8) handle tokenization/syntax, mid layers (9-20) handle semantics, late layers (21-32) handle coherence?
- Hypothesis to test via layer bypass ablations
- Example: bypassing layer 5 breaks indentation; bypassing layer 15 breaks variable scoping
Bug localization: If ablating head X fixes a bug, that head is likely causing the error:
- Example: model generates user.get_name() (wrong) → ablate head [18,3] → model generates user.name (correct)
- Causal diagnosis: head [18,3] is attending to incorrect API documentation context

Extension Beyond Existing Literature

Mechanistic interpretability literature (Wang et al., 2022 on GPT-2 circuits):

Focuses on individual mechanisms (e.g., indirect object identification circuit)
Requires manual circuit discovery by ML researchers (slow, expert-driven)
Not interactive or developer-facing

Your extension:

Developer-driven exploration: Non-experts (software engineers) can perform ablations without ML knowledge
Code generation focus: Ablations tailored to code tasks (syntactic correctness, API usage, variable scoping)
Real-time feedback: Immediate re-generation with ablated model (not batch analysis)
Task-oriented ablation: During bug fixing, developer can ablate to localize error source ("Which component is causing this bug?")

Bansal et al. (2022): "Rethinking the Role of Scale for In-Context Learning"

Analyzed layer contributions to ICL via interventions
Focused on language tasks (not code)
No interactive visualization for non-ML-experts

Your extension:

Interactive ablation: Developer controls which components to ablate
Code-specific metrics: Unit tests, AST parse, lints (not just perplexity)
Hypothesis-driven workflow: Developer predicts impact before seeing result

Novel Contribution: Ablation as Debugging Tool

Gap in literature: Ablation studies are typically research tools (for ML researchers analyzing models), not developer tools (for software engineers using models).

Your contribution: Reframe ablation as interactive debugging:

"Why did the model generate this bug?" → "Let me turn off components until it works correctly" → identifies faulty component
This is analogous to debuggers for traditional code (set breakpoints, step through execution)
But for neural networks: "ablation breakpoints" (turn off heads/layers), "step through architecture" (layer-by-layer pipeline)

Potential impact:

Developers without ML training can perform causal analysis
Faster bug diagnosis in LLM-generated code
Insights for model developers (which components are most critical for code generation?)

Attribution Ground Truth (Methodology)

A source token T_src is "influential" for generated token T_gen if:

T_src lies in top-k rollout sources (from Attention Visualization, k=8)
Masking the minimal set of heads H that carry attention from T_src → T_gen causes:
- Δlog-prob ≥ τ_Δ (e.g., 0.1) on T_gen, OR
- Flip in unit test outcome (pass → fail or vice versa)

This operational definition enables:

Reproducible measurement of "attribution accuracy"
Validation of attention-based hypotheses via ablation
Inter-rater reliability (two researchers apply same criteria)

Developer-Facing Research Questions

RQ1.9: Ablation-Assisted Debugging Can developers without ML expertise successfully use ablation to identify causes of buggy code generation?

Hypothesis H1.9: Developers using ablation tool will:

Correctly identify causal components (head/layer causing bug) in >60% of cases
Reduce time to diagnose bug by ≥25% vs baseline
Measured by: success rate of causal identification, time to diagnosis

RQ1.10: Mental Model Formation Do developers form accurate mental models of layer/head specialization after using ablation tool?

Hypothesis H1.10: After ablation exploration, developers will:

Correctly categorize heads as syntactic/semantic/positional with >65% accuracy
Describe layer roles (early=syntax, mid=semantics, late=coherence) with >70% agreement
Measured by: post-task categorization quiz, qualitative interview themes

RQ1.11: Iteration Reduction Does ablation tool reduce iterations needed to achieve passing solution?

Hypothesis H1.11 (from spec): Ablation tool reduces iterations to passing solution by ≥20%

Measured by: number of prompt modifications + code edits before all unit tests pass

RQ1.12: Causal vs Descriptive Understanding Do developers distinguish between correlation (attention) and causation (ablation)?

Hypothesis H1.12: Developers will:

Request ablation validation for >50% of attention-based hypotheses
Report understanding that "attention ≠ causation" (>80% agreement in survey)
Measured by: telemetry (how often developers cross-reference Attention + Ablation), survey responses

4. Pipeline Visualization

Opaque Mechanism Addressed

Layer-by-layer representation transformation - the "forward pass" through 32 transformer layers where:

Input embeddings gradually transform into output logits
Each layer applies: self-attention → FFN → layer norm → residual connection
Intermediate representations are high-dimensional (hidden_dim = 4096 for Code Llama 7B) and semantically opaque

Sources of opacity:

No visibility into intermediate states (black box from input → output)
Unclear where "understanding" emerges (early vs late layers?)
Unknown bottlenecks (which layers struggle most? where does model get confused?)
Residual connections create complex information flow (not simple feedforward)

Transformation to Interpretability

Primary contribution: Temporal decomposition + interpretable layer-level signals

Layer-by-layer scrubbing: Timeline UI to "scrub" through layers 0→32, showing how representations evolve:
- Visualize as swimlane: horizontal axis = layers, vertical axis = tokens
- Each "swim" represents one token's journey through the architecture
- Color intensity = uncertainty (entropy) at that layer
Interpretable signals (not raw activations):
- Residual-norm z-scores: How much each layer changes the representation
```
z_l = (||x_l|| - μ_l) / σ_l
```
  - High z → layer is "working hard" (significant transformation)
  - Low z → layer passes information through with minimal change
- Entropy shift: Change in output entropy from pre- to post-layer
```
ΔH_l = H(logits after layer l) - H(logits before layer l)
```
  - Negative ΔH → layer reduces uncertainty (good)
  - Positive ΔH → layer increases uncertainty (confusion)
- Attention-flow saturation: % of attention mass concentrated on top-m positions
```
Saturation = ∑(top-m attention weights) / ∑(all attention weights)
```
  - High saturation → focused attention (model is certain about sources)
  - Low saturation → diffuse attention (model is uncertain)
- Router load (MoE only): Which experts activate in mixture-of-experts layers
  - Expert IDs + gate weights
  - Imbalance metric (are all experts used equally?)
Swimlane/Timeline view:
- Lanes: Tokenizer → Embeddings → Layer 1 → ... → Layer 32 → Logits → Sampler → Post-proc/Tests
- Rectangle length = time per stage (latency profiling)
- Color = uncertainty (entropy)
- Hover = per-stage stats (residual-z, ΔH, saturation, latency)
Bottleneck identification:
- Flag layers in top-q percentile (e.g., top 10%) of:
  - Latency (slowest layers)
  - Residual-norm spikes (largest transformations)
  - Entropy jumps (biggest increases in uncertainty)
- Correlate bottlenecks with sampler behavior (does entropy spike → hallucination?)

What Code Generation Decisions It Reveals

Specific insights for developers:

Emergence of syntax: At which layer does model "realize" it's generating a function?
- Likely when indentation pattern appears, def keyword generated
- Measure: residual-norm spike at layer where syntactic structure emerges
- Example: Layer 5 shows high residual-z when generating def factorial(n):
Semantic shift: Can we observe when model transitions from "reading prompt" (early layers) to "generating code" (late layers)?
- Early layers: high attention to prompt tokens, low residual-norm
- Mid layers: residual-norm increases (processing semantics)
- Late layers: attention shifts to recent generated tokens (auto-regressive generation)
Error propagation: If model generates bug at token T, can we trace back to which layer introduced the error?
- Look for entropy spike or residual-norm anomaly in layers before T
- Example: Model generates wrong variable name at token 50 → entropy jumps at layer 18 → investigate what happened at layer 18
Compute allocation: Which layers consume most compute? (Implications for model optimization)
- Latency profiling shows bottleneck layers
- Pruning candidates: layers with low residual-norm (minimal transformation) + high latency

Extension Beyond Existing Literature

Bansal et al. (2022) on in-context learning at 66B scale:

Analyzed layer contributions to ICL via interventions
Focused on language tasks (not code)
No interactive visualization for non-ML-experts
Static analysis (not real-time exploration)

Your extension:

Code-specific annotations: Label layers with code-relevant milestones:
- "Layer 8: syntax tree formed"
- "Layer 20: variable scope resolved"
- "Layer 28: stylistic formatting applied"
Multi-token tracking: Show pipeline evolution across multiple generated tokens (not just one forward pass)
Developer-friendly abstractions: Avoid technical jargon (hidden states, residual stream) → use "understanding evolution", "decision stages"
Comparative pipelines: Show pipeline for correct vs buggy outputs side-by-side (where do they diverge?)

Interpretability papers (general):

Focus on probing classifiers to test "what does layer X know?"
Require training additional models (probes)
Not interactive or real-time

Your extension:

No additional training: Use intrinsic signals (residual-norm, entropy)
Real-time: Compute signals during generation (< 10ms overhead)
Actionable: Developer can bypass layers to test hypotheses

Novel Contribution: Layer-Level Taxonomy for Code Generation

Gap in literature: No established taxonomy of what each transformer layer does during code generation specifically.

Zheng et al. (2025) survey attention heads, but not layer-level roles
Interpretability papers focus on language tasks (next-word prediction, sentiment, Q&A)
Code generation is different: requires syntax, semantics, formatting, executable correctness

Your contribution: Empirically identify layer specialization for code:

Layers 1-5: Tokenization + basic syntax
- Residual-norm spikes when processing delimiters, keywords
- Attention focuses on local syntax (brackets, colons)
Layers 6-15: Semantic understanding
- Residual-norm increases during identifier resolution
- Attention to variable declarations, type hints, docstrings
- Entropy decreases (model becomes more certain about semantics)
Layers 16-25: Reasoning/logic
- Residual-norm spikes during control flow generation (if/else, loops)
- Attention to prompt logic + recent generated code
- Entropy may increase temporarily (exploring logical alternatives)
Layers 26-32: Fluency/formatting
- Low residual-norm (minor refinements)
- Attention to recent tokens (auto-regressive)
- Entropy decreases (finalizing token choices)

If validated, this would be novel for code LLMs and could be Paper 1 contribution.

Developer-Facing Research Questions

RQ1.13: Layer Decision Identification Can developers identify at which layer the model "decides" on code structure (e.g., loop vs conditional)?

Hypothesis H1.13: Developers using pipeline visualization will:

Correctly identify decision layer within ±3 layers in >55% of cases
Report increased understanding of model's "thinking process" (>75% agreement)
Measured by: layer identification accuracy (ground truth = residual-norm + entropy spike analysis), survey responses

RQ1.14: Next-Token Prediction Improvement Does seeing pipeline evolution improve developers' ability to predict subsequent tokens?

Hypothesis H1.14 (from spec): Pipeline summaries improve next-token prediction accuracy

Developers predict next token after seeing pipeline → compare with baseline (no pipeline)
Expected improvement: +10-15 percentage points in top-3 accuracy
Measured by: prediction task (5 examples per participant)

RQ1.15: Error Localization Can developers use pipeline visualization to diagnose where in the model an error originates?

Hypothesis H1.15: Developers will:

Identify error-causing layer within ±5 layers in >50% of cases
Reduce time to diagnose error source by ≥20% vs baseline
Measured by: layer identification accuracy, time to diagnosis

RQ1.16: Actionable Insights for Prompting Can developers use layer knowledge to improve prompts?

Hypothesis H1.16: After seeing pipeline, developers will:

Adjust prompts to provide more context for early layers (syntax/semantics) in >30% of cases
Report understanding of "what the model needs" (>70% agreement)
Measured by: prompt modification patterns in telemetry, survey responses

Cross-Cutting Contributions

1. Unified Glass-Box Dashboard

Gap in literature: Prior work (Kou et al., Paltenghi et al., Zhao et al.) focuses on single mechanisms in isolation.

Your dashboard integrates:

Attention (spatial attribution)
Token Size & Confidence (probabilistic uncertainty + tokenization)
Ablation (causal attribution)
Pipeline (temporal evolution)

Developer can triangulate across multiple lenses:

Example: "Low confidence + scattered attention + early-layer bottleneck → likely hallucination"
Example: "High confidence + focused attention + but ablating head X fixes bug → head X is overriding correct information"

This holistic view is novel for code generation interpretability.

2. Task-Based Developer Study

Gap: Most interpretability papers evaluate on:

Synthetic tasks (toy models, simple examples)
Researcher-driven analysis (no end-users)
Post-hoc metrics (accuracy, perplexity)

Your study evaluates with:

~10 software engineers doing realistic code tasks (bug detection, code review, prompt optimization)
In-the-loop: Developers use visualizations during task (not passive observation)
Actionable interpretability: Measure whether visualizations improve task performance (time, accuracy, trust)

This is HCI-grounded interpretability research, not just ML analysis.

3. Code Generation Domain Specificity

Gap: Explainability surveys (Zhao et al.) are domain-agnostic. Code has unique properties:

Syntactic correctness is binary (parsable or not) → enables AST-based metrics
Semantic correctness is testable (unit tests) → enables test-based metrics
Developer expertise varies (junior vs senior) → enables expertise-based analysis

Your visualizations tailored to code:

Syntax highlighting in attention maps (keywords, identifiers, operators color-coded)
Tokenization awareness for identifiers (rare in NLP interpretability)
Ablation targeting code-specific heads (bracket matching, indentation, API usage)
Pipeline stages mapped to code generation phases (syntax → semantics → logic → formatting)

4. Interventionist Interpretability

Gap: Most explainability tools are passive (show model behavior).

Your dashboard is active:

Ablation allows causal intervention ("What if I remove this head?")
Confidence allows alternative exploration ("What else could the model have generated?")
Pipeline allows temporal investigation ("Where did the model's understanding emerge?")

Developers don't just observe - they manipulate and test hypotheses.

This is closer to scientist-model interaction (hypothesis-driven) than user-model consumption (passive).

Literature Positioning Summary

Your Contribution	Related Work	Gap You Address
Attention Viz	Kou et al. (2024) - attention alignment	Interactive, per-head, code-specific, hypothesis-driven
Token Confidence	Zhao et al. (2024) - prob explanations	Tokenization awareness, code thresholds, bug prediction
Ablation Viz	Wang et al. (2022) - mechanistic interpretability	Developer-facing, real-time, code metrics (tests/AST)
Pipeline Viz	Bansal et al. (2022) - layer interventions	Code-specific stages, interpretable signals, interactive
Unified Dashboard	-	First multi-mechanism glass-box for code LLMs
Developer Study	Paltenghi et al. (2022) - eye-tracking	Task-based, in-the-loop, actionable metrics
Code Specificity	-	Syntax/test metrics, tokenization, developer expertise
Interventionist	-	Ablation, alternatives, hypothesis testing

Thesis Structure Suggestions

Chapter 1: Introduction

Motivation: Developers treat LLMs as black boxes → trust issues, debugging difficulties
Gap: Prior work lacks interactive, developer-facing, multi-mechanism dashboards for code
Contribution: First glass-box dashboard integrating 4 interpretability lenses + developer study

Chapter 2: Literature Review

Section 2.1: Attention in LLMs (Zheng et al., Kou et al.)
Section 2.2: Explainability methods (Zhao et al.)
Section 2.3: Code generation LLMs (Bistarelli et al.)
Section 2.4: Developer-AI interaction (Paltenghi et al.)
Section 2.5: Mechanistic interpretability (Wang et al., Bansal et al.)

Chapter 3: Methodology (RQ1 Focus)

Section 3.1: Attention Visualization
Section 3.2: Token Size & Confidence Visualization
Section 3.3: Ablation Visualization
Section 3.4: Pipeline Visualization
Section 3.5: Dashboard Integration

Chapter 4: User Study Design

Section 4.1: Participants (n=18-24 software engineers)
Section 4.2: Tasks (T1, T2, T3)
Section 4.3: Metrics (quantitative + qualitative)
Section 4.4: Protocol (within-subjects, Latin square)

Chapter 5: Results

Section 5.1: RQ1.1-RQ1.4 (Attention)
Section 5.2: RQ1.5-RQ1.8 (Token Confidence)
Section 5.3: RQ1.9-RQ1.12 (Ablation)
Section 5.4: RQ1.13-RQ1.16 (Pipeline)
Section 5.5: Cross-Cutting Themes

Chapter 6: Discussion

Section 6.1: Interpretability for Developers (not just researchers)
Section 6.2: Code-Specific Insights (tokenization, syntax, tests)
Section 6.3: Limitations & Future Work

Chapter 7: Conclusion

Summary of Contributions
Implications for Practice (tool design for developers)
Implications for Research (novel layer taxonomy, ablation as debugging)

ICML Paper 1 Suggestions

Title: "Making Transformer Architecture Transparent for Code Generation: A Developer-Centric Study"

Abstract Structure:

Problem: Developers use code LLMs as black boxes → trust/debugging issues
Gap: Prior interpretability work not developer-facing or code-specific
Solution: Glass-box dashboard with 4 visualizations (Attention, Token Confidence, Ablation, Pipeline)
Study: n=18-24 software engineers on 3 code tasks
Results: (placeholder for actual results)
- Attention viz improves source identification (H1-Attn)
- Token confidence flags predict bugs (H2-Tok, AUC ≥ 0.70)
- Ablation reduces debugging iterations (H3-Abl, -20%)
- Pipeline improves error localization (H4-Pipe)
Contribution: First empirical evidence that multi-mechanism interpretability tools improve developer performance on code tasks

Sections:

Introduction
Related Work
Dashboard Design (4 visualizations)
User Study
Results
Discussion
Conclusion

Target: ICML 2026 (submission ~January 2026)

End of RQ1 Mapping Document