Spaces:
Sleeping
Sleeping
File size: 38,362 Bytes
37ed739 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 |
# RQ1 Mapping: How Each Visualization Addresses Architectural Transparency
**Research Question 1:** "How can we transform opaque architectural mechanisms (multi-head attention, feed-forward networks, mixture-of-experts routing) into interpretable visual representations that reveal how LLMs make code generation decisions?"
**Document Version:** 1.0
**Date:** 2025-11-01
**Author:** Gary Boon, Northumbria University
---
## Executive Summary
This document maps each of the 4 visualizations (Attention, Token Size & Confidence, Ablation, Pipeline) to RQ1, explaining:
1. What opaque mechanism each visualization addresses
2. How it transforms that mechanism into an interpretable representation
3. What code generation decisions it reveals
4. How it extends beyond existing literature
5. Specific research sub-questions for the user study
---
## 1. Attention Visualization (QKV Explorer)
### Opaque Mechanism Addressed
**Multi-head self-attention** - the fundamental mechanism by which transformers weight input tokens when generating each output token.
**Sources of opacity:**
- 32+ heads operating in parallel (Code Llama 7B has 32 heads Γ 32 layers = 1,024 attention heads)
- High-dimensional attention score matrices (hidden_dim Γ seq_length)
- Non-interpretable weight distributions across heads
- Unclear semantic specialization of individual heads
### Transformation to Interpretability
**Primary contribution:** Spatial decomposition + interactive querying
1. **Head-level decomposition:** Display each attention head's behavior separately, allowing identification of specialized roles:
- Syntactic heads focusing on matching brackets, indentation
- Semantic heads attending to variable definitions, type hints
- Positional heads capturing code structure (function boundaries, control flow)
2. **Token-to-token attribution:** Interactive heat maps showing which prompt tokens each generated code token attends to, with normalized attention weights (0-1 scale):
- Rows = generated tokens
- Columns = prompt + context tokens
- Heat intensity = attention weight
- Hover = exact weights + source spans
3. **Attention rollout:** Composition of attention across layers (Kovaleva-style) to show information flow from input to output:
```
A_rollout = A_L Γ A_(L-1) Γ ... Γ A_1
```
This reveals which input tokens contribute to each output token through the entire network stack.
4. **Head role grid:** Layer Γ Head matrix with mini-sparklines showing mean attention to token classes:
- Delimiters (brackets, colons, commas)
- Identifiers (variable names, function names)
- Keywords (def, class, if, for)
- Comments (docstrings)
### What Code Generation Decisions It Reveals
**Specific insights for developers:**
1. **Identifier resolution:** When model generates `user.name`, which prior prompt tokens did it attend to?
- Expected: variable declaration `user = User(...)`, type hints `user: User`, docstrings describing user object
- Misalignment: over-attending to recent tokens (recency bias) instead of declaration site
2. **Syntactic correctness:** Do specific heads focus on bracket matching, indentation patterns, or control flow structure?
- Example: Head [Layer 5, Head 3] might specialize in matching opening/closing brackets
- Example: Head [Layer 8, Head 12] might attend to indentation levels for syntactic consistency
3. **Context utilization:** Is the model actually "reading" the prompt context, or over-attending to recent tokens?
- Recency bias indicator: >70% attention mass on last 5 tokens
- Long-range dependency: attention to tokens >100 positions back
4. **Error attribution:** When buggy code is generated, can we trace it to misaligned attention?
- Example: Model generates `user.get_name()` but should be `user.name` β attention shows model attended to API doc snippet instead of variable declaration
- Example: Model generates incorrect variable name β attention shows model confused two similar identifiers in context
### Extension Beyond Existing Literature
**Kou et al. (2024): "Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?"**
- Showed attention misalignment with human programmers
- Used aggregate metrics (averaged across heads/layers)
- Post-hoc analysis (no interactive exploration)
- Passive comparison (developers not in control)
**Your extension:**
- **Interactive head selection:** Developer chooses which head/layer to inspect in real-time
- **Code-specific annotations:** Highlight syntactic elements (keywords, identifiers, operators) with domain-specific color coding
- **Counterfactual queries:** "What if I remove this docstring? How does attention redistribute?"
- **Task-embedded evaluation:** Developers use the tool during actual code review tasks (bug detection, prompt optimization), not just correlation studies
**Paltenghi et al. (2022): "Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration"**
- Eye-tracking study comparing developer attention to model attention
- Focus on code exploration, not generation
- No interactive visualization for developers
**Your extension:**
- **Generative focus:** Attention during code generation, not just comprehension
- **Interactive tool:** Developers manipulate and query attention, not just observe
- **Causal validation:** Attention hypotheses validated via ablation (Section 3)
**Zheng et al. (2025): "Attention Heads of Large Language Models: A Survey"**
- Taxonomy of attention head discovery methods:
1. Model-free (saliency, gradient-based)
2. Modeling-required (probing classifiers)
- Primarily for ML researchers analyzing models
**Your positioning:**
- **Model-free + developer-in-the-loop:** No additional training, but leverages human domain expertise for interpretation
- **Novel category:** "Developer-driven interpretability" - non-ML-experts can explore attention patterns and form hypotheses about head roles
### Developer-Facing Research Questions
**RQ1.1: Head Role Discovery**
Can developers identify which attention heads are responsible for syntactic correctness vs semantic coherence?
**Hypothesis H1.1:** Developers using the attention visualization will correctly identify:
- Syntactic heads (bracket matching, indentation) with >70% accuracy
- Semantic heads (identifier resolution, type inference) with >60% accuracy
- Measured by: agreement with ground truth head roles (established via ablation studies)
**RQ1.2: Error Prediction**
Does seeing attention distributions improve developers' ability to predict model errors?
**Hypothesis H1.2:** Developers with attention visualization will:
- Predict buggy outputs 25% faster than baseline
- Increase bug detection accuracy by β₯15 percentage points
- Measured by: time to flag suspicious tokens, precision/recall of bug predictions
**RQ1.3: Attention-Expectation Alignment**
How do developers' attention expectations differ from model attention patterns?
**Hypothesis H1.3:** Developers will report misalignment in:
- >40% of generated tokens (model attends to unexpected sources)
- Especially for API usage and rare identifiers
- Measured by: developer annotations of "surprising" attention patterns + post-task interviews
**RQ1.4: Recency Bias Awareness**
Can developers identify when the model exhibits recency bias (over-attending to recent tokens)?
**Hypothesis H1.4:** With recency bias flags (>70% attention on last 5 tokens), developers will:
- Correctly identify recency bias cases with >80% accuracy
- Adjust prompts to mitigate bias in >50% of cases
- Measured by: flag accuracy vs ground truth, prompt modification patterns
---
## 2. Token Size & Confidence Visualization
### Opaque Mechanism Addressed
**Probability distribution over vocabulary** at each decoding step + **tokenization granularity**
**Sources of opacity:**
- 32K-50K vocab size (Code Llama) making full distribution uninterpretable
- Softmax scores calibrated to model's training distribution, not developer confidence
- Tokenization artifacts:
- `"user"` tokenized as one token vs `"username"` as two tokens `["user", "name"]`
- Rare identifiers split into nonsensical subwords: `"pytorch"` β `["py", "tor", "ch"]`
- Hidden relationship between entropy and actual error likelihood
### Transformation to Interpretability
**Primary contribution:** Uncertainty quantification + token granularity exposure
1. **Per-token confidence scores:** Display top-k alternatives with probabilities:
```
"for" at 0.89
"while" at 0.07
"if" at 0.03
```
This shows model's uncertainty and plausible alternatives.
2. **Entropy-based uncertainty:** Shannon entropy as proxy for model uncertainty:
```
H = -β p_i log(p_i)
```
- High entropy = many plausible alternatives (model is guessing)
- Low entropy = one clear choice (model is confident)
3. **Tokenization visibility:** Show exact token boundaries (BPE/SentencePiece splits) to reveal when model is uncertain due to subword chunking:
- Visual: token chips with width proportional to byte length
- Chip color/opacity reflects confidence (desaturated = low confidence)
- Example: `get_user_data` might be tokenized as `["get", "_user", "_data"]` (3 tokens) vs `["get_user_data"]` (1 token)
4. **Hallucination risk indicators:** Flag tokens with high entropy + low maximum probability:
- Entropy β₯ Ο_H (e.g., 1.5 nats)
- Max probability < 0.5
- This indicates model is "guessing" with no clear preference
5. **Risk hotspot flags:** Identifiers split into β₯3 subwords AND entropy peak:
- These are statistically more likely to be bugs (to be validated in user study)
- Example: `process_user_data` β `["process", "_user", "_data"]` with H = 1.8 nats β FLAG
### What Code Generation Decisions It Reveals
**Specific insights for developers:**
1. **Variable naming:** When model generates `usr` vs `user`, was this high-confidence choice or arbitrary selection from similar alternatives?
- Check top-k: if `["usr": 0.51, "user": 0.48]` β model is uncertain
- Check entropy: if H = 1.2 nats β borderline uncertainty
- Developer can manually select preferred alternative
2. **API usage:** Does model confidently predict correct method names (e.g., `.append()`) or waver between alternatives (`.add()`, `.push()`, `.insert()`)?
- Low confidence on API calls β likely hallucination or incorrect usage
- High confidence on incorrect API β model has learned wrong pattern (training data issue)
3. **Tokenization mismatches:** Does splitting `process_data` into `["process", "_data"]` vs `["process_", "data"]` affect model confidence?
- Hypothesis: multi-split identifiers correlate with lower confidence
- Mechanism: model's vocabulary doesn't contain full identifier, so it reconstructs from subwords
- Developer insight: use simpler identifiers (fewer underscores, camelCase) for better model confidence
4. **Implicit assumptions:** High confidence on incorrect code suggests model has learned wrong patterns:
- Example: model generates `list.append(x)` with 0.95 confidence, but list is actually a numpy array (should be `np.append(list, x)`)
- This reveals model's training data bias (more Python lists than numpy arrays in training set)
### Extension Beyond Existing Literature
**Zhao et al. (2024): "Explainability for Large Language Models: A Survey"**
- Covers probability-based explanations but mostly:
- Aggregate metrics (perplexity, log-likelihood)
- Not code-specific
- No tokenization awareness
**Your extension:**
- **Code-aware thresholds:** Calibrate "low confidence" thresholds specifically for code tokens:
- Keywords (def, class) typically high confidence
- Identifiers vary (common names high, rare names low)
- Operators high confidence
- Different threshold Ο_H for each category
- **Tokenization pedagogy:** Educate developers on how BPE affects model's "view" of code:
- Most code LLM papers (Bistarelli et al., 2025 review) ignore tokenization effects
- Developers rarely aware that identifier choice affects tokenization
- Your tool makes this visible β potential prompt engineering insight
- **Alternative exploration:** Let developers click on low-confidence tokens to see *why* alternatives were plausible:
- Show attention snippet: which context tokens justified each alternative?
- Link to Attention visualization for deeper investigation
- **Real-time confidence:** Stream confidence scores during generation, not just post-hoc analysis:
- Developer can interrupt generation if confidence drops below threshold
- Useful for interactive coding assistants
### Novel Contribution: Tokenization Γ Confidence Interaction
**Gap in literature:** Most code generation papers ignore tokenization effects. But:
- `variable_name` (snake_case) vs `variableName` (camelCase) tokenized differently β different confidence profiles
- Short vs long identifier names have different entropy characteristics
- Rare API names may be split into nonsensical subwords β low confidence
**Your visualization makes this visible** - potentially novel for code LLM research.
**Hypothesis:** Multi-split identifiers (β₯3 subwords) + entropy peaks predict bugs better than entropy alone.
### Developer-Facing Research Questions
**RQ1.5: Confidence-Based Bug Detection**
Can developers use token confidence to identify likely bugs faster than code inspection alone?
**Hypothesis H1.5:** Developers with confidence visualization will:
- Identify bugs 20% faster than baseline
- Increase bug detection precision by β₯10 percentage points
- Measured by: time to identify bug, precision/recall of bug locations
**RQ1.6: Tokenization Awareness**
Does seeing tokenization boundaries change developers' prompt engineering strategies?
**Hypothesis H1.6:** After using token size visualization, developers will:
- Report increased awareness of tokenization (>70% agree in post-survey)
- Adjust identifier naming in prompts (>40% of participants)
- Measured by: survey responses, prompt modification patterns in telemetry
**RQ1.7: Confidence Calibration**
Do high-confidence errors undermine trust more than low-confidence errors?
**Hypothesis H1.7:** Developers will report:
- Lower trust when high-confidence predictions are wrong (β₯1 point on 7-point scale)
- Appropriate trust calibration when confidence aligns with correctness
- Measured by: Brier score (calibration metric), trust survey responses
**RQ1.8: Bug-Risk AUC**
Do entropy Γ token-size hotspot flags predict actual bug locations?
**Hypothesis H1.8 (from spec):** AUC β₯ 0.70 for hotspot predictor vs actual bug locations
- Measured by: ROC curve analysis, ground truth = unit test failures + manual bug annotations
---
## 3. Ablation Visualization
### Opaque Mechanism Addressed
**Causal attribution of model components** - specifically:
- Which attention heads are critical vs redundant?
- Which layers perform feature extraction vs reasoning?
- Which feed-forward networks (FFN) contribute to code-specific decisions?
**Sources of opacity:**
- Distributed computation across 32 layers Γ 32 heads = 1,024 attention heads (Code Llama 7B)
- Non-linear interactions between components (head X in layer Y may depend on head Z in layer W)
- Unclear redundancy: can model compensate if one head is removed?
- Black-box causality: correlation (attention weights) β causation (actual influence)
### Transformation to Interpretability
**Primary contribution:** Interactive causal intervention + comparative analysis
1. **Selective ablation:** Developer toggles individual heads, entire layers, or FFN blocks off:
- Head masking: zero out attention weights or set to uniform distribution
- Layer bypass: skip layer entirely, pass residual stream through unchanged
- FFN gate clamp: disable feed-forward network in specific layer
2. **Before/after comparison:** Side-by-side display of original output vs ablated output:
- Unified diff showing changed tokens (color-coded: added/removed/modified)
- Line-level changes for multi-line code generation
- Structural changes (AST diff) to show semantic impact
3. **Quantitative impact metrics:**
- **Token-level change rate:** % tokens that changed after ablation
- **Semantic similarity:** CodeBLEU, embedding distance (cosine similarity)
- **Syntactic correctness:** AST parse success (can code be parsed?)
- **Functional correctness:** Unit tests passed (does code work?)
- **Static analysis:** ruff/bandit warnings (code quality/security issues)
- **Ξlog-prob:** Change in log-probability of each token
4. **Per-token delta heat:** Visualize Ξlog-prob and Ξentropy per token:
- Small multiples showing impact of ablating each of top-k heads
- Identify most-impactful heads (Ξlog-prob β₯ Ο_Ξ, e.g., 0.1)
5. **Hypothesis testing workflow:**
- Developer predicts impact before ablation ("I think head [12,5] handles bracket matching")
- Execute ablation
- Verify prediction (did brackets break?)
- Iteratively refine mental model of head roles
### What Code Generation Decisions It Reveals
**Specific insights for developers:**
1. **Critical heads:** Identify which heads, if removed, break code generation entirely:
- Example: ablating head [Layer 3, Head 7] causes all bracket matching to fail β this head is critical for syntactic correctness
- Implication: model relies on specific architectural component for basic syntax
2. **Redundant heads:** Which heads can be removed with minimal impact?
- Example: ablating head [Layer 25, Head 14] changes only 2% of tokens β this head is redundant
- Implication: model is over-parameterized (could be pruned for efficiency)
3. **Layer specialization:** Early layers (1-8) handle tokenization/syntax, mid layers (9-20) handle semantics, late layers (21-32) handle coherence?
- Hypothesis to test via layer bypass ablations
- Example: bypassing layer 5 breaks indentation; bypassing layer 15 breaks variable scoping
4. **Bug localization:** If ablating head X fixes a bug, that head is likely causing the error:
- Example: model generates `user.get_name()` (wrong) β ablate head [18,3] β model generates `user.name` (correct)
- Causal diagnosis: head [18,3] is attending to incorrect API documentation context
### Extension Beyond Existing Literature
**Mechanistic interpretability literature (Wang et al., 2022 on GPT-2 circuits):**
- Focuses on individual mechanisms (e.g., indirect object identification circuit)
- Requires manual circuit discovery by ML researchers (slow, expert-driven)
- Not interactive or developer-facing
**Your extension:**
- **Developer-driven exploration:** Non-experts (software engineers) can perform ablations without ML knowledge
- **Code generation focus:** Ablations tailored to code tasks (syntactic correctness, API usage, variable scoping)
- **Real-time feedback:** Immediate re-generation with ablated model (not batch analysis)
- **Task-oriented ablation:** During bug fixing, developer can ablate to localize error source ("Which component is causing this bug?")
**Bansal et al. (2022): "Rethinking the Role of Scale for In-Context Learning"**
- Analyzed layer contributions to ICL via interventions
- Focused on language tasks (not code)
- No interactive visualization for non-ML-experts
**Your extension:**
- **Interactive ablation:** Developer controls which components to ablate
- **Code-specific metrics:** Unit tests, AST parse, lints (not just perplexity)
- **Hypothesis-driven workflow:** Developer predicts impact before seeing result
### Novel Contribution: Ablation as Debugging Tool
**Gap in literature:** Ablation studies are typically **research tools** (for ML researchers analyzing models), not **developer tools** (for software engineers using models).
**Your contribution:** Reframe ablation as **interactive debugging**:
- "Why did the model generate this bug?" β "Let me turn off components until it works correctly" β identifies faulty component
- This is analogous to debuggers for traditional code (set breakpoints, step through execution)
- But for neural networks: "ablation breakpoints" (turn off heads/layers), "step through architecture" (layer-by-layer pipeline)
**Potential impact:**
- Developers without ML training can perform causal analysis
- Faster bug diagnosis in LLM-generated code
- Insights for model developers (which components are most critical for code generation?)
### Attribution Ground Truth (Methodology)
A source token T_src is "influential" for generated token T_gen if:
1. T_src lies in top-k rollout sources (from Attention Visualization, k=8)
2. Masking the minimal set of heads H that carry attention from T_src β T_gen causes:
- Ξlog-prob β₯ Ο_Ξ (e.g., 0.1) on T_gen, OR
- Flip in unit test outcome (pass β fail or vice versa)
This operational definition enables:
- Reproducible measurement of "attribution accuracy"
- Validation of attention-based hypotheses via ablation
- Inter-rater reliability (two researchers apply same criteria)
### Developer-Facing Research Questions
**RQ1.9: Ablation-Assisted Debugging**
Can developers without ML expertise successfully use ablation to identify causes of buggy code generation?
**Hypothesis H1.9:** Developers using ablation tool will:
- Correctly identify causal components (head/layer causing bug) in >60% of cases
- Reduce time to diagnose bug by β₯25% vs baseline
- Measured by: success rate of causal identification, time to diagnosis
**RQ1.10: Mental Model Formation**
Do developers form accurate mental models of layer/head specialization after using ablation tool?
**Hypothesis H1.10:** After ablation exploration, developers will:
- Correctly categorize heads as syntactic/semantic/positional with >65% accuracy
- Describe layer roles (early=syntax, mid=semantics, late=coherence) with >70% agreement
- Measured by: post-task categorization quiz, qualitative interview themes
**RQ1.11: Iteration Reduction**
Does ablation tool reduce iterations needed to achieve passing solution?
**Hypothesis H1.11 (from spec):** Ablation tool reduces iterations to passing solution by β₯20%
- Measured by: number of prompt modifications + code edits before all unit tests pass
**RQ1.12: Causal vs Descriptive Understanding**
Do developers distinguish between correlation (attention) and causation (ablation)?
**Hypothesis H1.12:** Developers will:
- Request ablation validation for >50% of attention-based hypotheses
- Report understanding that "attention β causation" (>80% agreement in survey)
- Measured by: telemetry (how often developers cross-reference Attention + Ablation), survey responses
---
## 4. Pipeline Visualization
### Opaque Mechanism Addressed
**Layer-by-layer representation transformation** - the "forward pass" through 32 transformer layers where:
- Input embeddings gradually transform into output logits
- Each layer applies: self-attention β FFN β layer norm β residual connection
- Intermediate representations are high-dimensional (hidden_dim = 4096 for Code Llama 7B) and semantically opaque
**Sources of opacity:**
- No visibility into intermediate states (black box from input β output)
- Unclear where "understanding" emerges (early vs late layers?)
- Unknown bottlenecks (which layers struggle most? where does model get confused?)
- Residual connections create complex information flow (not simple feedforward)
### Transformation to Interpretability
**Primary contribution:** Temporal decomposition + interpretable layer-level signals
1. **Layer-by-layer scrubbing:** Timeline UI to "scrub" through layers 0β32, showing how representations evolve:
- Visualize as swimlane: horizontal axis = layers, vertical axis = tokens
- Each "swim" represents one token's journey through the architecture
- Color intensity = uncertainty (entropy) at that layer
2. **Interpretable signals (not raw activations):**
- **Residual-norm z-scores:** How much each layer changes the representation
```
z_l = (||x_l|| - ΞΌ_l) / Ο_l
```
- High z β layer is "working hard" (significant transformation)
- Low z β layer passes information through with minimal change
- **Entropy shift:** Change in output entropy from pre- to post-layer
```
ΞH_l = H(logits after layer l) - H(logits before layer l)
```
- Negative ΞH β layer reduces uncertainty (good)
- Positive ΞH β layer increases uncertainty (confusion)
- **Attention-flow saturation:** % of attention mass concentrated on top-m positions
```
Saturation = β(top-m attention weights) / β(all attention weights)
```
- High saturation β focused attention (model is certain about sources)
- Low saturation β diffuse attention (model is uncertain)
- **Router load (MoE only):** Which experts activate in mixture-of-experts layers
- Expert IDs + gate weights
- Imbalance metric (are all experts used equally?)
3. **Swimlane/Timeline view:**
- Lanes: Tokenizer β Embeddings β Layer 1 β ... β Layer 32 β Logits β Sampler β Post-proc/Tests
- Rectangle length = time per stage (latency profiling)
- Color = uncertainty (entropy)
- Hover = per-stage stats (residual-z, ΞH, saturation, latency)
4. **Bottleneck identification:**
- Flag layers in top-q percentile (e.g., top 10%) of:
- Latency (slowest layers)
- Residual-norm spikes (largest transformations)
- Entropy jumps (biggest increases in uncertainty)
- Correlate bottlenecks with sampler behavior (does entropy spike β hallucination?)
### What Code Generation Decisions It Reveals
**Specific insights for developers:**
1. **Emergence of syntax:** At which layer does model "realize" it's generating a function?
- Likely when indentation pattern appears, `def` keyword generated
- Measure: residual-norm spike at layer where syntactic structure emerges
- Example: Layer 5 shows high residual-z when generating `def factorial(n):`
2. **Semantic shift:** Can we observe when model transitions from "reading prompt" (early layers) to "generating code" (late layers)?
- Early layers: high attention to prompt tokens, low residual-norm
- Mid layers: residual-norm increases (processing semantics)
- Late layers: attention shifts to recent generated tokens (auto-regressive generation)
3. **Error propagation:** If model generates bug at token T, can we trace back to which layer introduced the error?
- Look for entropy spike or residual-norm anomaly in layers before T
- Example: Model generates wrong variable name at token 50 β entropy jumps at layer 18 β investigate what happened at layer 18
4. **Compute allocation:** Which layers consume most compute? (Implications for model optimization)
- Latency profiling shows bottleneck layers
- Pruning candidates: layers with low residual-norm (minimal transformation) + high latency
### Extension Beyond Existing Literature
**Bansal et al. (2022) on in-context learning at 66B scale:**
- Analyzed layer contributions to ICL via interventions
- Focused on language tasks (not code)
- No interactive visualization for non-ML-experts
- Static analysis (not real-time exploration)
**Your extension:**
- **Code-specific annotations:** Label layers with code-relevant milestones:
- "Layer 8: syntax tree formed"
- "Layer 20: variable scope resolved"
- "Layer 28: stylistic formatting applied"
- **Multi-token tracking:** Show pipeline evolution across multiple generated tokens (not just one forward pass)
- **Developer-friendly abstractions:** Avoid technical jargon (hidden states, residual stream) β use "understanding evolution", "decision stages"
- **Comparative pipelines:** Show pipeline for correct vs buggy outputs side-by-side (where do they diverge?)
**Interpretability papers (general):**
- Focus on probing classifiers to test "what does layer X know?"
- Require training additional models (probes)
- Not interactive or real-time
**Your extension:**
- **No additional training:** Use intrinsic signals (residual-norm, entropy)
- **Real-time:** Compute signals during generation (< 10ms overhead)
- **Actionable:** Developer can bypass layers to test hypotheses
### Novel Contribution: Layer-Level Taxonomy for Code Generation
**Gap in literature:** No established taxonomy of what each transformer layer does during **code generation** specifically.
- Zheng et al. (2025) survey attention heads, but not layer-level roles
- Interpretability papers focus on language tasks (next-word prediction, sentiment, Q&A)
- Code generation is different: requires syntax, semantics, formatting, executable correctness
**Your contribution:** Empirically identify layer specialization for code:
1. **Layers 1-5: Tokenization + basic syntax**
- Residual-norm spikes when processing delimiters, keywords
- Attention focuses on local syntax (brackets, colons)
2. **Layers 6-15: Semantic understanding**
- Residual-norm increases during identifier resolution
- Attention to variable declarations, type hints, docstrings
- Entropy decreases (model becomes more certain about semantics)
3. **Layers 16-25: Reasoning/logic**
- Residual-norm spikes during control flow generation (if/else, loops)
- Attention to prompt logic + recent generated code
- Entropy may increase temporarily (exploring logical alternatives)
4. **Layers 26-32: Fluency/formatting**
- Low residual-norm (minor refinements)
- Attention to recent tokens (auto-regressive)
- Entropy decreases (finalizing token choices)
**If validated, this would be novel for code LLMs and could be Paper 1 contribution.**
### Developer-Facing Research Questions
**RQ1.13: Layer Decision Identification**
Can developers identify at which layer the model "decides" on code structure (e.g., loop vs conditional)?
**Hypothesis H1.13:** Developers using pipeline visualization will:
- Correctly identify decision layer within Β±3 layers in >55% of cases
- Report increased understanding of model's "thinking process" (>75% agreement)
- Measured by: layer identification accuracy (ground truth = residual-norm + entropy spike analysis), survey responses
**RQ1.14: Next-Token Prediction Improvement**
Does seeing pipeline evolution improve developers' ability to predict subsequent tokens?
**Hypothesis H1.14 (from spec):** Pipeline summaries improve next-token prediction accuracy
- Developers predict next token after seeing pipeline β compare with baseline (no pipeline)
- Expected improvement: +10-15 percentage points in top-3 accuracy
- Measured by: prediction task (5 examples per participant)
**RQ1.15: Error Localization**
Can developers use pipeline visualization to diagnose *where* in the model an error originates?
**Hypothesis H1.15:** Developers will:
- Identify error-causing layer within Β±5 layers in >50% of cases
- Reduce time to diagnose error source by β₯20% vs baseline
- Measured by: layer identification accuracy, time to diagnosis
**RQ1.16: Actionable Insights for Prompting**
Can developers use layer knowledge to improve prompts?
**Hypothesis H1.16:** After seeing pipeline, developers will:
- Adjust prompts to provide more context for early layers (syntax/semantics) in >30% of cases
- Report understanding of "what the model needs" (>70% agreement)
- Measured by: prompt modification patterns in telemetry, survey responses
---
## Cross-Cutting Contributions
### 1. Unified Glass-Box Dashboard
**Gap in literature:** Prior work (Kou et al., Paltenghi et al., Zhao et al.) focuses on **single mechanisms** in isolation.
**Your dashboard integrates:**
- **Attention** (spatial attribution)
- **Token Size & Confidence** (probabilistic uncertainty + tokenization)
- **Ablation** (causal attribution)
- **Pipeline** (temporal evolution)
**Developer can triangulate across multiple lenses:**
- Example: "Low confidence + scattered attention + early-layer bottleneck β likely hallucination"
- Example: "High confidence + focused attention + but ablating head X fixes bug β head X is overriding correct information"
**This holistic view is novel for code generation interpretability.**
### 2. Task-Based Developer Study
**Gap:** Most interpretability papers evaluate on:
- Synthetic tasks (toy models, simple examples)
- Researcher-driven analysis (no end-users)
- Post-hoc metrics (accuracy, perplexity)
**Your study evaluates with:**
- **~10 software engineers** doing realistic code tasks (bug detection, code review, prompt optimization)
- **In-the-loop**: Developers use visualizations during task (not passive observation)
- **Actionable interpretability**: Measure whether visualizations improve task performance (time, accuracy, trust)
**This is HCI-grounded interpretability research**, not just ML analysis.
### 3. Code Generation Domain Specificity
**Gap:** Explainability surveys (Zhao et al.) are domain-agnostic. Code has unique properties:
- **Syntactic correctness is binary** (parsable or not) β enables AST-based metrics
- **Semantic correctness is testable** (unit tests) β enables test-based metrics
- **Developer expertise varies** (junior vs senior) β enables expertise-based analysis
**Your visualizations tailored to code:**
- **Syntax highlighting** in attention maps (keywords, identifiers, operators color-coded)
- **Tokenization awareness** for identifiers (rare in NLP interpretability)
- **Ablation targeting code-specific heads** (bracket matching, indentation, API usage)
- **Pipeline stages mapped to code generation phases** (syntax β semantics β logic β formatting)
### 4. Interventionist Interpretability
**Gap:** Most explainability tools are **passive** (show model behavior).
**Your dashboard is **active**:**
- **Ablation allows causal intervention** ("What if I remove this head?")
- **Confidence allows alternative exploration** ("What else could the model have generated?")
- **Pipeline allows temporal investigation** ("Where did the model's understanding emerge?")
**Developers don't just observe - they manipulate and test hypotheses.**
**This is closer to scientist-model interaction (hypothesis-driven) than user-model consumption (passive).**
---
## Literature Positioning Summary
| Your Contribution | Related Work | Gap You Address |
|-------------------|--------------|-----------------|
| **Attention Viz** | Kou et al. (2024) - attention alignment | Interactive, per-head, code-specific, hypothesis-driven |
| **Token Confidence** | Zhao et al. (2024) - prob explanations | Tokenization awareness, code thresholds, bug prediction |
| **Ablation Viz** | Wang et al. (2022) - mechanistic interpretability | Developer-facing, real-time, code metrics (tests/AST) |
| **Pipeline Viz** | Bansal et al. (2022) - layer interventions | Code-specific stages, interpretable signals, interactive |
| **Unified Dashboard** | - | First multi-mechanism glass-box for code LLMs |
| **Developer Study** | Paltenghi et al. (2022) - eye-tracking | Task-based, in-the-loop, actionable metrics |
| **Code Specificity** | - | Syntax/test metrics, tokenization, developer expertise |
| **Interventionist** | - | Ablation, alternatives, hypothesis testing |
---
## Thesis Structure Suggestions
### Chapter 1: Introduction
- **Motivation:** Developers treat LLMs as black boxes β trust issues, debugging difficulties
- **Gap:** Prior work lacks interactive, developer-facing, multi-mechanism dashboards for code
- **Contribution:** First glass-box dashboard integrating 4 interpretability lenses + developer study
### Chapter 2: Literature Review
- **Section 2.1:** Attention in LLMs (Zheng et al., Kou et al.)
- **Section 2.2:** Explainability methods (Zhao et al.)
- **Section 2.3:** Code generation LLMs (Bistarelli et al.)
- **Section 2.4:** Developer-AI interaction (Paltenghi et al.)
- **Section 2.5:** Mechanistic interpretability (Wang et al., Bansal et al.)
### Chapter 3: Methodology (RQ1 Focus)
- **Section 3.1:** Attention Visualization
- **Section 3.2:** Token Size & Confidence Visualization
- **Section 3.3:** Ablation Visualization
- **Section 3.4:** Pipeline Visualization
- **Section 3.5:** Dashboard Integration
### Chapter 4: User Study Design
- **Section 4.1:** Participants (n=18-24 software engineers)
- **Section 4.2:** Tasks (T1, T2, T3)
- **Section 4.3:** Metrics (quantitative + qualitative)
- **Section 4.4:** Protocol (within-subjects, Latin square)
### Chapter 5: Results
- **Section 5.1:** RQ1.1-RQ1.4 (Attention)
- **Section 5.2:** RQ1.5-RQ1.8 (Token Confidence)
- **Section 5.3:** RQ1.9-RQ1.12 (Ablation)
- **Section 5.4:** RQ1.13-RQ1.16 (Pipeline)
- **Section 5.5:** Cross-Cutting Themes
### Chapter 6: Discussion
- **Section 6.1:** Interpretability for Developers (not just researchers)
- **Section 6.2:** Code-Specific Insights (tokenization, syntax, tests)
- **Section 6.3:** Limitations & Future Work
### Chapter 7: Conclusion
- **Summary of Contributions**
- **Implications for Practice** (tool design for developers)
- **Implications for Research** (novel layer taxonomy, ablation as debugging)
---
## ICML Paper 1 Suggestions
**Title:** "Making Transformer Architecture Transparent for Code Generation: A Developer-Centric Study"
**Abstract Structure:**
1. **Problem:** Developers use code LLMs as black boxes β trust/debugging issues
2. **Gap:** Prior interpretability work not developer-facing or code-specific
3. **Solution:** Glass-box dashboard with 4 visualizations (Attention, Token Confidence, Ablation, Pipeline)
4. **Study:** n=18-24 software engineers on 3 code tasks
5. **Results:** (placeholder for actual results)
- Attention viz improves source identification (H1-Attn)
- Token confidence flags predict bugs (H2-Tok, AUC β₯ 0.70)
- Ablation reduces debugging iterations (H3-Abl, -20%)
- Pipeline improves error localization (H4-Pipe)
6. **Contribution:** First empirical evidence that multi-mechanism interpretability tools improve developer performance on code tasks
**Sections:**
1. Introduction
2. Related Work
3. Dashboard Design (4 visualizations)
4. User Study
5. Results
6. Discussion
7. Conclusion
**Target:** ICML 2026 (submission ~January 2026)
---
**End of RQ1 Mapping Document**
|