Spaces:
Sleeping
RQ1 Mapping: How Each Visualization Addresses Architectural Transparency
Research Question 1: "How can we transform opaque architectural mechanisms (multi-head attention, feed-forward networks, mixture-of-experts routing) into interpretable visual representations that reveal how LLMs make code generation decisions?"
Document Version: 1.0 Date: 2025-11-01 Author: Gary Boon, Northumbria University
Executive Summary
This document maps each of the 4 visualizations (Attention, Token Size & Confidence, Ablation, Pipeline) to RQ1, explaining:
- What opaque mechanism each visualization addresses
- How it transforms that mechanism into an interpretable representation
- What code generation decisions it reveals
- How it extends beyond existing literature
- Specific research sub-questions for the user study
1. Attention Visualization (QKV Explorer)
Opaque Mechanism Addressed
Multi-head self-attention - the fundamental mechanism by which transformers weight input tokens when generating each output token.
Sources of opacity:
- 32+ heads operating in parallel (Code Llama 7B has 32 heads Γ 32 layers = 1,024 attention heads)
- High-dimensional attention score matrices (hidden_dim Γ seq_length)
- Non-interpretable weight distributions across heads
- Unclear semantic specialization of individual heads
Transformation to Interpretability
Primary contribution: Spatial decomposition + interactive querying
Head-level decomposition: Display each attention head's behavior separately, allowing identification of specialized roles:
- Syntactic heads focusing on matching brackets, indentation
- Semantic heads attending to variable definitions, type hints
- Positional heads capturing code structure (function boundaries, control flow)
Token-to-token attribution: Interactive heat maps showing which prompt tokens each generated code token attends to, with normalized attention weights (0-1 scale):
- Rows = generated tokens
- Columns = prompt + context tokens
- Heat intensity = attention weight
- Hover = exact weights + source spans
Attention rollout: Composition of attention across layers (Kovaleva-style) to show information flow from input to output:
A_rollout = A_L Γ A_(L-1) Γ ... Γ A_1This reveals which input tokens contribute to each output token through the entire network stack.
Head role grid: Layer Γ Head matrix with mini-sparklines showing mean attention to token classes:
- Delimiters (brackets, colons, commas)
- Identifiers (variable names, function names)
- Keywords (def, class, if, for)
- Comments (docstrings)
What Code Generation Decisions It Reveals
Specific insights for developers:
Identifier resolution: When model generates
user.name, which prior prompt tokens did it attend to?- Expected: variable declaration
user = User(...), type hintsuser: User, docstrings describing user object - Misalignment: over-attending to recent tokens (recency bias) instead of declaration site
- Expected: variable declaration
Syntactic correctness: Do specific heads focus on bracket matching, indentation patterns, or control flow structure?
- Example: Head [Layer 5, Head 3] might specialize in matching opening/closing brackets
- Example: Head [Layer 8, Head 12] might attend to indentation levels for syntactic consistency
Context utilization: Is the model actually "reading" the prompt context, or over-attending to recent tokens?
- Recency bias indicator: >70% attention mass on last 5 tokens
- Long-range dependency: attention to tokens >100 positions back
Error attribution: When buggy code is generated, can we trace it to misaligned attention?
- Example: Model generates
user.get_name()but should beuser.nameβ attention shows model attended to API doc snippet instead of variable declaration - Example: Model generates incorrect variable name β attention shows model confused two similar identifiers in context
- Example: Model generates
Extension Beyond Existing Literature
Kou et al. (2024): "Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?"
- Showed attention misalignment with human programmers
- Used aggregate metrics (averaged across heads/layers)
- Post-hoc analysis (no interactive exploration)
- Passive comparison (developers not in control)
Your extension:
- Interactive head selection: Developer chooses which head/layer to inspect in real-time
- Code-specific annotations: Highlight syntactic elements (keywords, identifiers, operators) with domain-specific color coding
- Counterfactual queries: "What if I remove this docstring? How does attention redistribute?"
- Task-embedded evaluation: Developers use the tool during actual code review tasks (bug detection, prompt optimization), not just correlation studies
Paltenghi et al. (2022): "Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration"
- Eye-tracking study comparing developer attention to model attention
- Focus on code exploration, not generation
- No interactive visualization for developers
Your extension:
- Generative focus: Attention during code generation, not just comprehension
- Interactive tool: Developers manipulate and query attention, not just observe
- Causal validation: Attention hypotheses validated via ablation (Section 3)
Zheng et al. (2025): "Attention Heads of Large Language Models: A Survey"
- Taxonomy of attention head discovery methods:
- Model-free (saliency, gradient-based)
- Modeling-required (probing classifiers)
- Primarily for ML researchers analyzing models
Your positioning:
- Model-free + developer-in-the-loop: No additional training, but leverages human domain expertise for interpretation
- Novel category: "Developer-driven interpretability" - non-ML-experts can explore attention patterns and form hypotheses about head roles
Developer-Facing Research Questions
RQ1.1: Head Role Discovery Can developers identify which attention heads are responsible for syntactic correctness vs semantic coherence?
Hypothesis H1.1: Developers using the attention visualization will correctly identify:
- Syntactic heads (bracket matching, indentation) with >70% accuracy
- Semantic heads (identifier resolution, type inference) with >60% accuracy
- Measured by: agreement with ground truth head roles (established via ablation studies)
RQ1.2: Error Prediction Does seeing attention distributions improve developers' ability to predict model errors?
Hypothesis H1.2: Developers with attention visualization will:
- Predict buggy outputs 25% faster than baseline
- Increase bug detection accuracy by β₯15 percentage points
- Measured by: time to flag suspicious tokens, precision/recall of bug predictions
RQ1.3: Attention-Expectation Alignment How do developers' attention expectations differ from model attention patterns?
Hypothesis H1.3: Developers will report misalignment in:
40% of generated tokens (model attends to unexpected sources)
- Especially for API usage and rare identifiers
- Measured by: developer annotations of "surprising" attention patterns + post-task interviews
RQ1.4: Recency Bias Awareness Can developers identify when the model exhibits recency bias (over-attending to recent tokens)?
Hypothesis H1.4: With recency bias flags (>70% attention on last 5 tokens), developers will:
- Correctly identify recency bias cases with >80% accuracy
- Adjust prompts to mitigate bias in >50% of cases
- Measured by: flag accuracy vs ground truth, prompt modification patterns
2. Token Size & Confidence Visualization
Opaque Mechanism Addressed
Probability distribution over vocabulary at each decoding step + tokenization granularity
Sources of opacity:
- 32K-50K vocab size (Code Llama) making full distribution uninterpretable
- Softmax scores calibrated to model's training distribution, not developer confidence
- Tokenization artifacts:
"user"tokenized as one token vs"username"as two tokens["user", "name"]- Rare identifiers split into nonsensical subwords:
"pytorch"β["py", "tor", "ch"]
- Hidden relationship between entropy and actual error likelihood
Transformation to Interpretability
Primary contribution: Uncertainty quantification + token granularity exposure
Per-token confidence scores: Display top-k alternatives with probabilities:
"for" at 0.89 "while" at 0.07 "if" at 0.03This shows model's uncertainty and plausible alternatives.
Entropy-based uncertainty: Shannon entropy as proxy for model uncertainty:
H = -β p_i log(p_i)- High entropy = many plausible alternatives (model is guessing)
- Low entropy = one clear choice (model is confident)
Tokenization visibility: Show exact token boundaries (BPE/SentencePiece splits) to reveal when model is uncertain due to subword chunking:
- Visual: token chips with width proportional to byte length
- Chip color/opacity reflects confidence (desaturated = low confidence)
- Example:
get_user_datamight be tokenized as["get", "_user", "_data"](3 tokens) vs["get_user_data"](1 token)
Hallucination risk indicators: Flag tokens with high entropy + low maximum probability:
- Entropy β₯ Ο_H (e.g., 1.5 nats)
- Max probability < 0.5
- This indicates model is "guessing" with no clear preference
Risk hotspot flags: Identifiers split into β₯3 subwords AND entropy peak:
- These are statistically more likely to be bugs (to be validated in user study)
- Example:
process_user_dataβ["process", "_user", "_data"]with H = 1.8 nats β FLAG
What Code Generation Decisions It Reveals
Specific insights for developers:
Variable naming: When model generates
usrvsuser, was this high-confidence choice or arbitrary selection from similar alternatives?- Check top-k: if
["usr": 0.51, "user": 0.48]β model is uncertain - Check entropy: if H = 1.2 nats β borderline uncertainty
- Developer can manually select preferred alternative
- Check top-k: if
API usage: Does model confidently predict correct method names (e.g.,
.append()) or waver between alternatives (.add(),.push(),.insert())?- Low confidence on API calls β likely hallucination or incorrect usage
- High confidence on incorrect API β model has learned wrong pattern (training data issue)
Tokenization mismatches: Does splitting
process_datainto["process", "_data"]vs["process_", "data"]affect model confidence?- Hypothesis: multi-split identifiers correlate with lower confidence
- Mechanism: model's vocabulary doesn't contain full identifier, so it reconstructs from subwords
- Developer insight: use simpler identifiers (fewer underscores, camelCase) for better model confidence
Implicit assumptions: High confidence on incorrect code suggests model has learned wrong patterns:
- Example: model generates
list.append(x)with 0.95 confidence, but list is actually a numpy array (should benp.append(list, x)) - This reveals model's training data bias (more Python lists than numpy arrays in training set)
- Example: model generates
Extension Beyond Existing Literature
Zhao et al. (2024): "Explainability for Large Language Models: A Survey"
- Covers probability-based explanations but mostly:
- Aggregate metrics (perplexity, log-likelihood)
- Not code-specific
- No tokenization awareness
Your extension:
Code-aware thresholds: Calibrate "low confidence" thresholds specifically for code tokens:
- Keywords (def, class) typically high confidence
- Identifiers vary (common names high, rare names low)
- Operators high confidence
- Different threshold Ο_H for each category
Tokenization pedagogy: Educate developers on how BPE affects model's "view" of code:
- Most code LLM papers (Bistarelli et al., 2025 review) ignore tokenization effects
- Developers rarely aware that identifier choice affects tokenization
- Your tool makes this visible β potential prompt engineering insight
Alternative exploration: Let developers click on low-confidence tokens to see why alternatives were plausible:
- Show attention snippet: which context tokens justified each alternative?
- Link to Attention visualization for deeper investigation
Real-time confidence: Stream confidence scores during generation, not just post-hoc analysis:
- Developer can interrupt generation if confidence drops below threshold
- Useful for interactive coding assistants
Novel Contribution: Tokenization Γ Confidence Interaction
Gap in literature: Most code generation papers ignore tokenization effects. But:
variable_name(snake_case) vsvariableName(camelCase) tokenized differently β different confidence profiles- Short vs long identifier names have different entropy characteristics
- Rare API names may be split into nonsensical subwords β low confidence
Your visualization makes this visible - potentially novel for code LLM research.
Hypothesis: Multi-split identifiers (β₯3 subwords) + entropy peaks predict bugs better than entropy alone.
Developer-Facing Research Questions
RQ1.5: Confidence-Based Bug Detection Can developers use token confidence to identify likely bugs faster than code inspection alone?
Hypothesis H1.5: Developers with confidence visualization will:
- Identify bugs 20% faster than baseline
- Increase bug detection precision by β₯10 percentage points
- Measured by: time to identify bug, precision/recall of bug locations
RQ1.6: Tokenization Awareness Does seeing tokenization boundaries change developers' prompt engineering strategies?
Hypothesis H1.6: After using token size visualization, developers will:
- Report increased awareness of tokenization (>70% agree in post-survey)
- Adjust identifier naming in prompts (>40% of participants)
- Measured by: survey responses, prompt modification patterns in telemetry
RQ1.7: Confidence Calibration Do high-confidence errors undermine trust more than low-confidence errors?
Hypothesis H1.7: Developers will report:
- Lower trust when high-confidence predictions are wrong (β₯1 point on 7-point scale)
- Appropriate trust calibration when confidence aligns with correctness
- Measured by: Brier score (calibration metric), trust survey responses
RQ1.8: Bug-Risk AUC Do entropy Γ token-size hotspot flags predict actual bug locations?
Hypothesis H1.8 (from spec): AUC β₯ 0.70 for hotspot predictor vs actual bug locations
- Measured by: ROC curve analysis, ground truth = unit test failures + manual bug annotations
3. Ablation Visualization
Opaque Mechanism Addressed
Causal attribution of model components - specifically:
- Which attention heads are critical vs redundant?
- Which layers perform feature extraction vs reasoning?
- Which feed-forward networks (FFN) contribute to code-specific decisions?
Sources of opacity:
- Distributed computation across 32 layers Γ 32 heads = 1,024 attention heads (Code Llama 7B)
- Non-linear interactions between components (head X in layer Y may depend on head Z in layer W)
- Unclear redundancy: can model compensate if one head is removed?
- Black-box causality: correlation (attention weights) β causation (actual influence)
Transformation to Interpretability
Primary contribution: Interactive causal intervention + comparative analysis
Selective ablation: Developer toggles individual heads, entire layers, or FFN blocks off:
- Head masking: zero out attention weights or set to uniform distribution
- Layer bypass: skip layer entirely, pass residual stream through unchanged
- FFN gate clamp: disable feed-forward network in specific layer
Before/after comparison: Side-by-side display of original output vs ablated output:
- Unified diff showing changed tokens (color-coded: added/removed/modified)
- Line-level changes for multi-line code generation
- Structural changes (AST diff) to show semantic impact
Quantitative impact metrics:
- Token-level change rate: % tokens that changed after ablation
- Semantic similarity: CodeBLEU, embedding distance (cosine similarity)
- Syntactic correctness: AST parse success (can code be parsed?)
- Functional correctness: Unit tests passed (does code work?)
- Static analysis: ruff/bandit warnings (code quality/security issues)
- Ξlog-prob: Change in log-probability of each token
Per-token delta heat: Visualize Ξlog-prob and Ξentropy per token:
- Small multiples showing impact of ablating each of top-k heads
- Identify most-impactful heads (Ξlog-prob β₯ Ο_Ξ, e.g., 0.1)
Hypothesis testing workflow:
- Developer predicts impact before ablation ("I think head [12,5] handles bracket matching")
- Execute ablation
- Verify prediction (did brackets break?)
- Iteratively refine mental model of head roles
What Code Generation Decisions It Reveals
Specific insights for developers:
Critical heads: Identify which heads, if removed, break code generation entirely:
- Example: ablating head [Layer 3, Head 7] causes all bracket matching to fail β this head is critical for syntactic correctness
- Implication: model relies on specific architectural component for basic syntax
Redundant heads: Which heads can be removed with minimal impact?
- Example: ablating head [Layer 25, Head 14] changes only 2% of tokens β this head is redundant
- Implication: model is over-parameterized (could be pruned for efficiency)
Layer specialization: Early layers (1-8) handle tokenization/syntax, mid layers (9-20) handle semantics, late layers (21-32) handle coherence?
- Hypothesis to test via layer bypass ablations
- Example: bypassing layer 5 breaks indentation; bypassing layer 15 breaks variable scoping
Bug localization: If ablating head X fixes a bug, that head is likely causing the error:
- Example: model generates
user.get_name()(wrong) β ablate head [18,3] β model generatesuser.name(correct) - Causal diagnosis: head [18,3] is attending to incorrect API documentation context
- Example: model generates
Extension Beyond Existing Literature
Mechanistic interpretability literature (Wang et al., 2022 on GPT-2 circuits):
- Focuses on individual mechanisms (e.g., indirect object identification circuit)
- Requires manual circuit discovery by ML researchers (slow, expert-driven)
- Not interactive or developer-facing
Your extension:
- Developer-driven exploration: Non-experts (software engineers) can perform ablations without ML knowledge
- Code generation focus: Ablations tailored to code tasks (syntactic correctness, API usage, variable scoping)
- Real-time feedback: Immediate re-generation with ablated model (not batch analysis)
- Task-oriented ablation: During bug fixing, developer can ablate to localize error source ("Which component is causing this bug?")
Bansal et al. (2022): "Rethinking the Role of Scale for In-Context Learning"
- Analyzed layer contributions to ICL via interventions
- Focused on language tasks (not code)
- No interactive visualization for non-ML-experts
Your extension:
- Interactive ablation: Developer controls which components to ablate
- Code-specific metrics: Unit tests, AST parse, lints (not just perplexity)
- Hypothesis-driven workflow: Developer predicts impact before seeing result
Novel Contribution: Ablation as Debugging Tool
Gap in literature: Ablation studies are typically research tools (for ML researchers analyzing models), not developer tools (for software engineers using models).
Your contribution: Reframe ablation as interactive debugging:
- "Why did the model generate this bug?" β "Let me turn off components until it works correctly" β identifies faulty component
- This is analogous to debuggers for traditional code (set breakpoints, step through execution)
- But for neural networks: "ablation breakpoints" (turn off heads/layers), "step through architecture" (layer-by-layer pipeline)
Potential impact:
- Developers without ML training can perform causal analysis
- Faster bug diagnosis in LLM-generated code
- Insights for model developers (which components are most critical for code generation?)
Attribution Ground Truth (Methodology)
A source token T_src is "influential" for generated token T_gen if:
- T_src lies in top-k rollout sources (from Attention Visualization, k=8)
- Masking the minimal set of heads H that carry attention from T_src β T_gen causes:
- Ξlog-prob β₯ Ο_Ξ (e.g., 0.1) on T_gen, OR
- Flip in unit test outcome (pass β fail or vice versa)
This operational definition enables:
- Reproducible measurement of "attribution accuracy"
- Validation of attention-based hypotheses via ablation
- Inter-rater reliability (two researchers apply same criteria)
Developer-Facing Research Questions
RQ1.9: Ablation-Assisted Debugging Can developers without ML expertise successfully use ablation to identify causes of buggy code generation?
Hypothesis H1.9: Developers using ablation tool will:
- Correctly identify causal components (head/layer causing bug) in >60% of cases
- Reduce time to diagnose bug by β₯25% vs baseline
- Measured by: success rate of causal identification, time to diagnosis
RQ1.10: Mental Model Formation Do developers form accurate mental models of layer/head specialization after using ablation tool?
Hypothesis H1.10: After ablation exploration, developers will:
- Correctly categorize heads as syntactic/semantic/positional with >65% accuracy
- Describe layer roles (early=syntax, mid=semantics, late=coherence) with >70% agreement
- Measured by: post-task categorization quiz, qualitative interview themes
RQ1.11: Iteration Reduction Does ablation tool reduce iterations needed to achieve passing solution?
Hypothesis H1.11 (from spec): Ablation tool reduces iterations to passing solution by β₯20%
- Measured by: number of prompt modifications + code edits before all unit tests pass
RQ1.12: Causal vs Descriptive Understanding Do developers distinguish between correlation (attention) and causation (ablation)?
Hypothesis H1.12: Developers will:
- Request ablation validation for >50% of attention-based hypotheses
- Report understanding that "attention β causation" (>80% agreement in survey)
- Measured by: telemetry (how often developers cross-reference Attention + Ablation), survey responses
4. Pipeline Visualization
Opaque Mechanism Addressed
Layer-by-layer representation transformation - the "forward pass" through 32 transformer layers where:
- Input embeddings gradually transform into output logits
- Each layer applies: self-attention β FFN β layer norm β residual connection
- Intermediate representations are high-dimensional (hidden_dim = 4096 for Code Llama 7B) and semantically opaque
Sources of opacity:
- No visibility into intermediate states (black box from input β output)
- Unclear where "understanding" emerges (early vs late layers?)
- Unknown bottlenecks (which layers struggle most? where does model get confused?)
- Residual connections create complex information flow (not simple feedforward)
Transformation to Interpretability
Primary contribution: Temporal decomposition + interpretable layer-level signals
Layer-by-layer scrubbing: Timeline UI to "scrub" through layers 0β32, showing how representations evolve:
- Visualize as swimlane: horizontal axis = layers, vertical axis = tokens
- Each "swim" represents one token's journey through the architecture
- Color intensity = uncertainty (entropy) at that layer
Interpretable signals (not raw activations):
Residual-norm z-scores: How much each layer changes the representation
z_l = (||x_l|| - ΞΌ_l) / Ο_l- High z β layer is "working hard" (significant transformation)
- Low z β layer passes information through with minimal change
Entropy shift: Change in output entropy from pre- to post-layer
ΞH_l = H(logits after layer l) - H(logits before layer l)- Negative ΞH β layer reduces uncertainty (good)
- Positive ΞH β layer increases uncertainty (confusion)
Attention-flow saturation: % of attention mass concentrated on top-m positions
Saturation = β(top-m attention weights) / β(all attention weights)- High saturation β focused attention (model is certain about sources)
- Low saturation β diffuse attention (model is uncertain)
Router load (MoE only): Which experts activate in mixture-of-experts layers
- Expert IDs + gate weights
- Imbalance metric (are all experts used equally?)
Swimlane/Timeline view:
- Lanes: Tokenizer β Embeddings β Layer 1 β ... β Layer 32 β Logits β Sampler β Post-proc/Tests
- Rectangle length = time per stage (latency profiling)
- Color = uncertainty (entropy)
- Hover = per-stage stats (residual-z, ΞH, saturation, latency)
Bottleneck identification:
- Flag layers in top-q percentile (e.g., top 10%) of:
- Latency (slowest layers)
- Residual-norm spikes (largest transformations)
- Entropy jumps (biggest increases in uncertainty)
- Correlate bottlenecks with sampler behavior (does entropy spike β hallucination?)
- Flag layers in top-q percentile (e.g., top 10%) of:
What Code Generation Decisions It Reveals
Specific insights for developers:
Emergence of syntax: At which layer does model "realize" it's generating a function?
- Likely when indentation pattern appears,
defkeyword generated - Measure: residual-norm spike at layer where syntactic structure emerges
- Example: Layer 5 shows high residual-z when generating
def factorial(n):
- Likely when indentation pattern appears,
Semantic shift: Can we observe when model transitions from "reading prompt" (early layers) to "generating code" (late layers)?
- Early layers: high attention to prompt tokens, low residual-norm
- Mid layers: residual-norm increases (processing semantics)
- Late layers: attention shifts to recent generated tokens (auto-regressive generation)
Error propagation: If model generates bug at token T, can we trace back to which layer introduced the error?
- Look for entropy spike or residual-norm anomaly in layers before T
- Example: Model generates wrong variable name at token 50 β entropy jumps at layer 18 β investigate what happened at layer 18
Compute allocation: Which layers consume most compute? (Implications for model optimization)
- Latency profiling shows bottleneck layers
- Pruning candidates: layers with low residual-norm (minimal transformation) + high latency
Extension Beyond Existing Literature
Bansal et al. (2022) on in-context learning at 66B scale:
- Analyzed layer contributions to ICL via interventions
- Focused on language tasks (not code)
- No interactive visualization for non-ML-experts
- Static analysis (not real-time exploration)
Your extension:
- Code-specific annotations: Label layers with code-relevant milestones:
- "Layer 8: syntax tree formed"
- "Layer 20: variable scope resolved"
- "Layer 28: stylistic formatting applied"
- Multi-token tracking: Show pipeline evolution across multiple generated tokens (not just one forward pass)
- Developer-friendly abstractions: Avoid technical jargon (hidden states, residual stream) β use "understanding evolution", "decision stages"
- Comparative pipelines: Show pipeline for correct vs buggy outputs side-by-side (where do they diverge?)
Interpretability papers (general):
- Focus on probing classifiers to test "what does layer X know?"
- Require training additional models (probes)
- Not interactive or real-time
Your extension:
- No additional training: Use intrinsic signals (residual-norm, entropy)
- Real-time: Compute signals during generation (< 10ms overhead)
- Actionable: Developer can bypass layers to test hypotheses
Novel Contribution: Layer-Level Taxonomy for Code Generation
Gap in literature: No established taxonomy of what each transformer layer does during code generation specifically.
- Zheng et al. (2025) survey attention heads, but not layer-level roles
- Interpretability papers focus on language tasks (next-word prediction, sentiment, Q&A)
- Code generation is different: requires syntax, semantics, formatting, executable correctness
Your contribution: Empirically identify layer specialization for code:
Layers 1-5: Tokenization + basic syntax
- Residual-norm spikes when processing delimiters, keywords
- Attention focuses on local syntax (brackets, colons)
Layers 6-15: Semantic understanding
- Residual-norm increases during identifier resolution
- Attention to variable declarations, type hints, docstrings
- Entropy decreases (model becomes more certain about semantics)
Layers 16-25: Reasoning/logic
- Residual-norm spikes during control flow generation (if/else, loops)
- Attention to prompt logic + recent generated code
- Entropy may increase temporarily (exploring logical alternatives)
Layers 26-32: Fluency/formatting
- Low residual-norm (minor refinements)
- Attention to recent tokens (auto-regressive)
- Entropy decreases (finalizing token choices)
If validated, this would be novel for code LLMs and could be Paper 1 contribution.
Developer-Facing Research Questions
RQ1.13: Layer Decision Identification Can developers identify at which layer the model "decides" on code structure (e.g., loop vs conditional)?
Hypothesis H1.13: Developers using pipeline visualization will:
- Correctly identify decision layer within Β±3 layers in >55% of cases
- Report increased understanding of model's "thinking process" (>75% agreement)
- Measured by: layer identification accuracy (ground truth = residual-norm + entropy spike analysis), survey responses
RQ1.14: Next-Token Prediction Improvement Does seeing pipeline evolution improve developers' ability to predict subsequent tokens?
Hypothesis H1.14 (from spec): Pipeline summaries improve next-token prediction accuracy
- Developers predict next token after seeing pipeline β compare with baseline (no pipeline)
- Expected improvement: +10-15 percentage points in top-3 accuracy
- Measured by: prediction task (5 examples per participant)
RQ1.15: Error Localization Can developers use pipeline visualization to diagnose where in the model an error originates?
Hypothesis H1.15: Developers will:
- Identify error-causing layer within Β±5 layers in >50% of cases
- Reduce time to diagnose error source by β₯20% vs baseline
- Measured by: layer identification accuracy, time to diagnosis
RQ1.16: Actionable Insights for Prompting Can developers use layer knowledge to improve prompts?
Hypothesis H1.16: After seeing pipeline, developers will:
- Adjust prompts to provide more context for early layers (syntax/semantics) in >30% of cases
- Report understanding of "what the model needs" (>70% agreement)
- Measured by: prompt modification patterns in telemetry, survey responses
Cross-Cutting Contributions
1. Unified Glass-Box Dashboard
Gap in literature: Prior work (Kou et al., Paltenghi et al., Zhao et al.) focuses on single mechanisms in isolation.
Your dashboard integrates:
- Attention (spatial attribution)
- Token Size & Confidence (probabilistic uncertainty + tokenization)
- Ablation (causal attribution)
- Pipeline (temporal evolution)
Developer can triangulate across multiple lenses:
- Example: "Low confidence + scattered attention + early-layer bottleneck β likely hallucination"
- Example: "High confidence + focused attention + but ablating head X fixes bug β head X is overriding correct information"
This holistic view is novel for code generation interpretability.
2. Task-Based Developer Study
Gap: Most interpretability papers evaluate on:
- Synthetic tasks (toy models, simple examples)
- Researcher-driven analysis (no end-users)
- Post-hoc metrics (accuracy, perplexity)
Your study evaluates with:
- ~10 software engineers doing realistic code tasks (bug detection, code review, prompt optimization)
- In-the-loop: Developers use visualizations during task (not passive observation)
- Actionable interpretability: Measure whether visualizations improve task performance (time, accuracy, trust)
This is HCI-grounded interpretability research, not just ML analysis.
3. Code Generation Domain Specificity
Gap: Explainability surveys (Zhao et al.) are domain-agnostic. Code has unique properties:
- Syntactic correctness is binary (parsable or not) β enables AST-based metrics
- Semantic correctness is testable (unit tests) β enables test-based metrics
- Developer expertise varies (junior vs senior) β enables expertise-based analysis
Your visualizations tailored to code:
- Syntax highlighting in attention maps (keywords, identifiers, operators color-coded)
- Tokenization awareness for identifiers (rare in NLP interpretability)
- Ablation targeting code-specific heads (bracket matching, indentation, API usage)
- Pipeline stages mapped to code generation phases (syntax β semantics β logic β formatting)
4. Interventionist Interpretability
Gap: Most explainability tools are passive (show model behavior).
Your dashboard is active:
- Ablation allows causal intervention ("What if I remove this head?")
- Confidence allows alternative exploration ("What else could the model have generated?")
- Pipeline allows temporal investigation ("Where did the model's understanding emerge?")
Developers don't just observe - they manipulate and test hypotheses.
This is closer to scientist-model interaction (hypothesis-driven) than user-model consumption (passive).
Literature Positioning Summary
| Your Contribution | Related Work | Gap You Address |
|---|---|---|
| Attention Viz | Kou et al. (2024) - attention alignment | Interactive, per-head, code-specific, hypothesis-driven |
| Token Confidence | Zhao et al. (2024) - prob explanations | Tokenization awareness, code thresholds, bug prediction |
| Ablation Viz | Wang et al. (2022) - mechanistic interpretability | Developer-facing, real-time, code metrics (tests/AST) |
| Pipeline Viz | Bansal et al. (2022) - layer interventions | Code-specific stages, interpretable signals, interactive |
| Unified Dashboard | - | First multi-mechanism glass-box for code LLMs |
| Developer Study | Paltenghi et al. (2022) - eye-tracking | Task-based, in-the-loop, actionable metrics |
| Code Specificity | - | Syntax/test metrics, tokenization, developer expertise |
| Interventionist | - | Ablation, alternatives, hypothesis testing |
Thesis Structure Suggestions
Chapter 1: Introduction
- Motivation: Developers treat LLMs as black boxes β trust issues, debugging difficulties
- Gap: Prior work lacks interactive, developer-facing, multi-mechanism dashboards for code
- Contribution: First glass-box dashboard integrating 4 interpretability lenses + developer study
Chapter 2: Literature Review
- Section 2.1: Attention in LLMs (Zheng et al., Kou et al.)
- Section 2.2: Explainability methods (Zhao et al.)
- Section 2.3: Code generation LLMs (Bistarelli et al.)
- Section 2.4: Developer-AI interaction (Paltenghi et al.)
- Section 2.5: Mechanistic interpretability (Wang et al., Bansal et al.)
Chapter 3: Methodology (RQ1 Focus)
- Section 3.1: Attention Visualization
- Section 3.2: Token Size & Confidence Visualization
- Section 3.3: Ablation Visualization
- Section 3.4: Pipeline Visualization
- Section 3.5: Dashboard Integration
Chapter 4: User Study Design
- Section 4.1: Participants (n=18-24 software engineers)
- Section 4.2: Tasks (T1, T2, T3)
- Section 4.3: Metrics (quantitative + qualitative)
- Section 4.4: Protocol (within-subjects, Latin square)
Chapter 5: Results
- Section 5.1: RQ1.1-RQ1.4 (Attention)
- Section 5.2: RQ1.5-RQ1.8 (Token Confidence)
- Section 5.3: RQ1.9-RQ1.12 (Ablation)
- Section 5.4: RQ1.13-RQ1.16 (Pipeline)
- Section 5.5: Cross-Cutting Themes
Chapter 6: Discussion
- Section 6.1: Interpretability for Developers (not just researchers)
- Section 6.2: Code-Specific Insights (tokenization, syntax, tests)
- Section 6.3: Limitations & Future Work
Chapter 7: Conclusion
- Summary of Contributions
- Implications for Practice (tool design for developers)
- Implications for Research (novel layer taxonomy, ablation as debugging)
ICML Paper 1 Suggestions
Title: "Making Transformer Architecture Transparent for Code Generation: A Developer-Centric Study"
Abstract Structure:
- Problem: Developers use code LLMs as black boxes β trust/debugging issues
- Gap: Prior interpretability work not developer-facing or code-specific
- Solution: Glass-box dashboard with 4 visualizations (Attention, Token Confidence, Ablation, Pipeline)
- Study: n=18-24 software engineers on 3 code tasks
- Results: (placeholder for actual results)
- Attention viz improves source identification (H1-Attn)
- Token confidence flags predict bugs (H2-Tok, AUC β₯ 0.70)
- Ablation reduces debugging iterations (H3-Abl, -20%)
- Pipeline improves error localization (H4-Pipe)
- Contribution: First empirical evidence that multi-mechanism interpretability tools improve developer performance on code tasks
Sections:
- Introduction
- Related Work
- Dashboard Design (4 visualizations)
- User Study
- Results
- Discussion
- Conclusion
Target: ICML 2026 (submission ~January 2026)
End of RQ1 Mapping Document