File size: 38,362 Bytes
37ed739
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
# RQ1 Mapping: How Each Visualization Addresses Architectural Transparency

**Research Question 1:** "How can we transform opaque architectural mechanisms (multi-head attention, feed-forward networks, mixture-of-experts routing) into interpretable visual representations that reveal how LLMs make code generation decisions?"

**Document Version:** 1.0
**Date:** 2025-11-01
**Author:** Gary Boon, Northumbria University

---

## Executive Summary

This document maps each of the 4 visualizations (Attention, Token Size & Confidence, Ablation, Pipeline) to RQ1, explaining:
1. What opaque mechanism each visualization addresses
2. How it transforms that mechanism into an interpretable representation
3. What code generation decisions it reveals
4. How it extends beyond existing literature
5. Specific research sub-questions for the user study

---

## 1. Attention Visualization (QKV Explorer)

### Opaque Mechanism Addressed

**Multi-head self-attention** - the fundamental mechanism by which transformers weight input tokens when generating each output token.

**Sources of opacity:**
- 32+ heads operating in parallel (Code Llama 7B has 32 heads Γ— 32 layers = 1,024 attention heads)
- High-dimensional attention score matrices (hidden_dim Γ— seq_length)
- Non-interpretable weight distributions across heads
- Unclear semantic specialization of individual heads

### Transformation to Interpretability

**Primary contribution:** Spatial decomposition + interactive querying

1. **Head-level decomposition:** Display each attention head's behavior separately, allowing identification of specialized roles:
   - Syntactic heads focusing on matching brackets, indentation
   - Semantic heads attending to variable definitions, type hints
   - Positional heads capturing code structure (function boundaries, control flow)

2. **Token-to-token attribution:** Interactive heat maps showing which prompt tokens each generated code token attends to, with normalized attention weights (0-1 scale):
   - Rows = generated tokens
   - Columns = prompt + context tokens
   - Heat intensity = attention weight
   - Hover = exact weights + source spans

3. **Attention rollout:** Composition of attention across layers (Kovaleva-style) to show information flow from input to output:
   ```
   A_rollout = A_L Γ— A_(L-1) Γ— ... Γ— A_1
   ```
   This reveals which input tokens contribute to each output token through the entire network stack.

4. **Head role grid:** Layer Γ— Head matrix with mini-sparklines showing mean attention to token classes:
   - Delimiters (brackets, colons, commas)
   - Identifiers (variable names, function names)
   - Keywords (def, class, if, for)
   - Comments (docstrings)

### What Code Generation Decisions It Reveals

**Specific insights for developers:**

1. **Identifier resolution:** When model generates `user.name`, which prior prompt tokens did it attend to?
   - Expected: variable declaration `user = User(...)`, type hints `user: User`, docstrings describing user object
   - Misalignment: over-attending to recent tokens (recency bias) instead of declaration site

2. **Syntactic correctness:** Do specific heads focus on bracket matching, indentation patterns, or control flow structure?
   - Example: Head [Layer 5, Head 3] might specialize in matching opening/closing brackets
   - Example: Head [Layer 8, Head 12] might attend to indentation levels for syntactic consistency

3. **Context utilization:** Is the model actually "reading" the prompt context, or over-attending to recent tokens?
   - Recency bias indicator: >70% attention mass on last 5 tokens
   - Long-range dependency: attention to tokens >100 positions back

4. **Error attribution:** When buggy code is generated, can we trace it to misaligned attention?
   - Example: Model generates `user.get_name()` but should be `user.name` β†’ attention shows model attended to API doc snippet instead of variable declaration
   - Example: Model generates incorrect variable name β†’ attention shows model confused two similar identifiers in context

### Extension Beyond Existing Literature

**Kou et al. (2024): "Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?"**
- Showed attention misalignment with human programmers
- Used aggregate metrics (averaged across heads/layers)
- Post-hoc analysis (no interactive exploration)
- Passive comparison (developers not in control)

**Your extension:**
- **Interactive head selection:** Developer chooses which head/layer to inspect in real-time
- **Code-specific annotations:** Highlight syntactic elements (keywords, identifiers, operators) with domain-specific color coding
- **Counterfactual queries:** "What if I remove this docstring? How does attention redistribute?"
- **Task-embedded evaluation:** Developers use the tool during actual code review tasks (bug detection, prompt optimization), not just correlation studies

**Paltenghi et al. (2022): "Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration"**
- Eye-tracking study comparing developer attention to model attention
- Focus on code exploration, not generation
- No interactive visualization for developers

**Your extension:**
- **Generative focus:** Attention during code generation, not just comprehension
- **Interactive tool:** Developers manipulate and query attention, not just observe
- **Causal validation:** Attention hypotheses validated via ablation (Section 3)

**Zheng et al. (2025): "Attention Heads of Large Language Models: A Survey"**
- Taxonomy of attention head discovery methods:
  1. Model-free (saliency, gradient-based)
  2. Modeling-required (probing classifiers)
- Primarily for ML researchers analyzing models

**Your positioning:**
- **Model-free + developer-in-the-loop:** No additional training, but leverages human domain expertise for interpretation
- **Novel category:** "Developer-driven interpretability" - non-ML-experts can explore attention patterns and form hypotheses about head roles

### Developer-Facing Research Questions

**RQ1.1: Head Role Discovery**
Can developers identify which attention heads are responsible for syntactic correctness vs semantic coherence?

**Hypothesis H1.1:** Developers using the attention visualization will correctly identify:
- Syntactic heads (bracket matching, indentation) with >70% accuracy
- Semantic heads (identifier resolution, type inference) with >60% accuracy
- Measured by: agreement with ground truth head roles (established via ablation studies)

**RQ1.2: Error Prediction**
Does seeing attention distributions improve developers' ability to predict model errors?

**Hypothesis H1.2:** Developers with attention visualization will:
- Predict buggy outputs 25% faster than baseline
- Increase bug detection accuracy by β‰₯15 percentage points
- Measured by: time to flag suspicious tokens, precision/recall of bug predictions

**RQ1.3: Attention-Expectation Alignment**
How do developers' attention expectations differ from model attention patterns?

**Hypothesis H1.3:** Developers will report misalignment in:
- >40% of generated tokens (model attends to unexpected sources)
- Especially for API usage and rare identifiers
- Measured by: developer annotations of "surprising" attention patterns + post-task interviews

**RQ1.4: Recency Bias Awareness**
Can developers identify when the model exhibits recency bias (over-attending to recent tokens)?

**Hypothesis H1.4:** With recency bias flags (>70% attention on last 5 tokens), developers will:
- Correctly identify recency bias cases with >80% accuracy
- Adjust prompts to mitigate bias in >50% of cases
- Measured by: flag accuracy vs ground truth, prompt modification patterns

---

## 2. Token Size & Confidence Visualization

### Opaque Mechanism Addressed

**Probability distribution over vocabulary** at each decoding step + **tokenization granularity**

**Sources of opacity:**
- 32K-50K vocab size (Code Llama) making full distribution uninterpretable
- Softmax scores calibrated to model's training distribution, not developer confidence
- Tokenization artifacts:
  - `"user"` tokenized as one token vs `"username"` as two tokens `["user", "name"]`
  - Rare identifiers split into nonsensical subwords: `"pytorch"` β†’ `["py", "tor", "ch"]`
- Hidden relationship between entropy and actual error likelihood

### Transformation to Interpretability

**Primary contribution:** Uncertainty quantification + token granularity exposure

1. **Per-token confidence scores:** Display top-k alternatives with probabilities:
   ```
   "for" at 0.89
   "while" at 0.07
   "if" at 0.03
   ```
   This shows model's uncertainty and plausible alternatives.

2. **Entropy-based uncertainty:** Shannon entropy as proxy for model uncertainty:
   ```
   H = -βˆ‘ p_i log(p_i)
   ```
   - High entropy = many plausible alternatives (model is guessing)
   - Low entropy = one clear choice (model is confident)

3. **Tokenization visibility:** Show exact token boundaries (BPE/SentencePiece splits) to reveal when model is uncertain due to subword chunking:
   - Visual: token chips with width proportional to byte length
   - Chip color/opacity reflects confidence (desaturated = low confidence)
   - Example: `get_user_data` might be tokenized as `["get", "_user", "_data"]` (3 tokens) vs `["get_user_data"]` (1 token)

4. **Hallucination risk indicators:** Flag tokens with high entropy + low maximum probability:
   - Entropy β‰₯ Ο„_H (e.g., 1.5 nats)
   - Max probability < 0.5
   - This indicates model is "guessing" with no clear preference

5. **Risk hotspot flags:** Identifiers split into β‰₯3 subwords AND entropy peak:
   - These are statistically more likely to be bugs (to be validated in user study)
   - Example: `process_user_data` β†’ `["process", "_user", "_data"]` with H = 1.8 nats β†’ FLAG

### What Code Generation Decisions It Reveals

**Specific insights for developers:**

1. **Variable naming:** When model generates `usr` vs `user`, was this high-confidence choice or arbitrary selection from similar alternatives?
   - Check top-k: if `["usr": 0.51, "user": 0.48]` β†’ model is uncertain
   - Check entropy: if H = 1.2 nats β†’ borderline uncertainty
   - Developer can manually select preferred alternative

2. **API usage:** Does model confidently predict correct method names (e.g., `.append()`) or waver between alternatives (`.add()`, `.push()`, `.insert()`)?
   - Low confidence on API calls β†’ likely hallucination or incorrect usage
   - High confidence on incorrect API β†’ model has learned wrong pattern (training data issue)

3. **Tokenization mismatches:** Does splitting `process_data` into `["process", "_data"]` vs `["process_", "data"]` affect model confidence?
   - Hypothesis: multi-split identifiers correlate with lower confidence
   - Mechanism: model's vocabulary doesn't contain full identifier, so it reconstructs from subwords
   - Developer insight: use simpler identifiers (fewer underscores, camelCase) for better model confidence

4. **Implicit assumptions:** High confidence on incorrect code suggests model has learned wrong patterns:
   - Example: model generates `list.append(x)` with 0.95 confidence, but list is actually a numpy array (should be `np.append(list, x)`)
   - This reveals model's training data bias (more Python lists than numpy arrays in training set)

### Extension Beyond Existing Literature

**Zhao et al. (2024): "Explainability for Large Language Models: A Survey"**
- Covers probability-based explanations but mostly:
  - Aggregate metrics (perplexity, log-likelihood)
  - Not code-specific
  - No tokenization awareness

**Your extension:**
- **Code-aware thresholds:** Calibrate "low confidence" thresholds specifically for code tokens:
  - Keywords (def, class) typically high confidence
  - Identifiers vary (common names high, rare names low)
  - Operators high confidence
  - Different threshold Ο„_H for each category

- **Tokenization pedagogy:** Educate developers on how BPE affects model's "view" of code:
  - Most code LLM papers (Bistarelli et al., 2025 review) ignore tokenization effects
  - Developers rarely aware that identifier choice affects tokenization
  - Your tool makes this visible β†’ potential prompt engineering insight

- **Alternative exploration:** Let developers click on low-confidence tokens to see *why* alternatives were plausible:
  - Show attention snippet: which context tokens justified each alternative?
  - Link to Attention visualization for deeper investigation

- **Real-time confidence:** Stream confidence scores during generation, not just post-hoc analysis:
  - Developer can interrupt generation if confidence drops below threshold
  - Useful for interactive coding assistants

### Novel Contribution: Tokenization Γ— Confidence Interaction

**Gap in literature:** Most code generation papers ignore tokenization effects. But:
- `variable_name` (snake_case) vs `variableName` (camelCase) tokenized differently β†’ different confidence profiles
- Short vs long identifier names have different entropy characteristics
- Rare API names may be split into nonsensical subwords β†’ low confidence

**Your visualization makes this visible** - potentially novel for code LLM research.

**Hypothesis:** Multi-split identifiers (β‰₯3 subwords) + entropy peaks predict bugs better than entropy alone.

### Developer-Facing Research Questions

**RQ1.5: Confidence-Based Bug Detection**
Can developers use token confidence to identify likely bugs faster than code inspection alone?

**Hypothesis H1.5:** Developers with confidence visualization will:
- Identify bugs 20% faster than baseline
- Increase bug detection precision by β‰₯10 percentage points
- Measured by: time to identify bug, precision/recall of bug locations

**RQ1.6: Tokenization Awareness**
Does seeing tokenization boundaries change developers' prompt engineering strategies?

**Hypothesis H1.6:** After using token size visualization, developers will:
- Report increased awareness of tokenization (>70% agree in post-survey)
- Adjust identifier naming in prompts (>40% of participants)
- Measured by: survey responses, prompt modification patterns in telemetry

**RQ1.7: Confidence Calibration**
Do high-confidence errors undermine trust more than low-confidence errors?

**Hypothesis H1.7:** Developers will report:
- Lower trust when high-confidence predictions are wrong (β‰₯1 point on 7-point scale)
- Appropriate trust calibration when confidence aligns with correctness
- Measured by: Brier score (calibration metric), trust survey responses

**RQ1.8: Bug-Risk AUC**
Do entropy Γ— token-size hotspot flags predict actual bug locations?

**Hypothesis H1.8 (from spec):** AUC β‰₯ 0.70 for hotspot predictor vs actual bug locations
- Measured by: ROC curve analysis, ground truth = unit test failures + manual bug annotations

---

## 3. Ablation Visualization

### Opaque Mechanism Addressed

**Causal attribution of model components** - specifically:
- Which attention heads are critical vs redundant?
- Which layers perform feature extraction vs reasoning?
- Which feed-forward networks (FFN) contribute to code-specific decisions?

**Sources of opacity:**
- Distributed computation across 32 layers Γ— 32 heads = 1,024 attention heads (Code Llama 7B)
- Non-linear interactions between components (head X in layer Y may depend on head Z in layer W)
- Unclear redundancy: can model compensate if one head is removed?
- Black-box causality: correlation (attention weights) β‰  causation (actual influence)

### Transformation to Interpretability

**Primary contribution:** Interactive causal intervention + comparative analysis

1. **Selective ablation:** Developer toggles individual heads, entire layers, or FFN blocks off:
   - Head masking: zero out attention weights or set to uniform distribution
   - Layer bypass: skip layer entirely, pass residual stream through unchanged
   - FFN gate clamp: disable feed-forward network in specific layer

2. **Before/after comparison:** Side-by-side display of original output vs ablated output:
   - Unified diff showing changed tokens (color-coded: added/removed/modified)
   - Line-level changes for multi-line code generation
   - Structural changes (AST diff) to show semantic impact

3. **Quantitative impact metrics:**
   - **Token-level change rate:** % tokens that changed after ablation
   - **Semantic similarity:** CodeBLEU, embedding distance (cosine similarity)
   - **Syntactic correctness:** AST parse success (can code be parsed?)
   - **Functional correctness:** Unit tests passed (does code work?)
   - **Static analysis:** ruff/bandit warnings (code quality/security issues)
   - **Ξ”log-prob:** Change in log-probability of each token

4. **Per-token delta heat:** Visualize Ξ”log-prob and Ξ”entropy per token:
   - Small multiples showing impact of ablating each of top-k heads
   - Identify most-impactful heads (Ξ”log-prob β‰₯ Ο„_Ξ”, e.g., 0.1)

5. **Hypothesis testing workflow:**
   - Developer predicts impact before ablation ("I think head [12,5] handles bracket matching")
   - Execute ablation
   - Verify prediction (did brackets break?)
   - Iteratively refine mental model of head roles

### What Code Generation Decisions It Reveals

**Specific insights for developers:**

1. **Critical heads:** Identify which heads, if removed, break code generation entirely:
   - Example: ablating head [Layer 3, Head 7] causes all bracket matching to fail β†’ this head is critical for syntactic correctness
   - Implication: model relies on specific architectural component for basic syntax

2. **Redundant heads:** Which heads can be removed with minimal impact?
   - Example: ablating head [Layer 25, Head 14] changes only 2% of tokens β†’ this head is redundant
   - Implication: model is over-parameterized (could be pruned for efficiency)

3. **Layer specialization:** Early layers (1-8) handle tokenization/syntax, mid layers (9-20) handle semantics, late layers (21-32) handle coherence?
   - Hypothesis to test via layer bypass ablations
   - Example: bypassing layer 5 breaks indentation; bypassing layer 15 breaks variable scoping

4. **Bug localization:** If ablating head X fixes a bug, that head is likely causing the error:
   - Example: model generates `user.get_name()` (wrong) β†’ ablate head [18,3] β†’ model generates `user.name` (correct)
   - Causal diagnosis: head [18,3] is attending to incorrect API documentation context

### Extension Beyond Existing Literature

**Mechanistic interpretability literature (Wang et al., 2022 on GPT-2 circuits):**
- Focuses on individual mechanisms (e.g., indirect object identification circuit)
- Requires manual circuit discovery by ML researchers (slow, expert-driven)
- Not interactive or developer-facing

**Your extension:**
- **Developer-driven exploration:** Non-experts (software engineers) can perform ablations without ML knowledge
- **Code generation focus:** Ablations tailored to code tasks (syntactic correctness, API usage, variable scoping)
- **Real-time feedback:** Immediate re-generation with ablated model (not batch analysis)
- **Task-oriented ablation:** During bug fixing, developer can ablate to localize error source ("Which component is causing this bug?")

**Bansal et al. (2022): "Rethinking the Role of Scale for In-Context Learning"**
- Analyzed layer contributions to ICL via interventions
- Focused on language tasks (not code)
- No interactive visualization for non-ML-experts

**Your extension:**
- **Interactive ablation:** Developer controls which components to ablate
- **Code-specific metrics:** Unit tests, AST parse, lints (not just perplexity)
- **Hypothesis-driven workflow:** Developer predicts impact before seeing result

### Novel Contribution: Ablation as Debugging Tool

**Gap in literature:** Ablation studies are typically **research tools** (for ML researchers analyzing models), not **developer tools** (for software engineers using models).

**Your contribution:** Reframe ablation as **interactive debugging**:
- "Why did the model generate this bug?" β†’ "Let me turn off components until it works correctly" β†’ identifies faulty component
- This is analogous to debuggers for traditional code (set breakpoints, step through execution)
- But for neural networks: "ablation breakpoints" (turn off heads/layers), "step through architecture" (layer-by-layer pipeline)

**Potential impact:**
- Developers without ML training can perform causal analysis
- Faster bug diagnosis in LLM-generated code
- Insights for model developers (which components are most critical for code generation?)

### Attribution Ground Truth (Methodology)

A source token T_src is "influential" for generated token T_gen if:
1. T_src lies in top-k rollout sources (from Attention Visualization, k=8)
2. Masking the minimal set of heads H that carry attention from T_src β†’ T_gen causes:
   - Ξ”log-prob β‰₯ Ο„_Ξ” (e.g., 0.1) on T_gen, OR
   - Flip in unit test outcome (pass β†’ fail or vice versa)

This operational definition enables:
- Reproducible measurement of "attribution accuracy"
- Validation of attention-based hypotheses via ablation
- Inter-rater reliability (two researchers apply same criteria)

### Developer-Facing Research Questions

**RQ1.9: Ablation-Assisted Debugging**
Can developers without ML expertise successfully use ablation to identify causes of buggy code generation?

**Hypothesis H1.9:** Developers using ablation tool will:
- Correctly identify causal components (head/layer causing bug) in >60% of cases
- Reduce time to diagnose bug by β‰₯25% vs baseline
- Measured by: success rate of causal identification, time to diagnosis

**RQ1.10: Mental Model Formation**
Do developers form accurate mental models of layer/head specialization after using ablation tool?

**Hypothesis H1.10:** After ablation exploration, developers will:
- Correctly categorize heads as syntactic/semantic/positional with >65% accuracy
- Describe layer roles (early=syntax, mid=semantics, late=coherence) with >70% agreement
- Measured by: post-task categorization quiz, qualitative interview themes

**RQ1.11: Iteration Reduction**
Does ablation tool reduce iterations needed to achieve passing solution?

**Hypothesis H1.11 (from spec):** Ablation tool reduces iterations to passing solution by β‰₯20%
- Measured by: number of prompt modifications + code edits before all unit tests pass

**RQ1.12: Causal vs Descriptive Understanding**
Do developers distinguish between correlation (attention) and causation (ablation)?

**Hypothesis H1.12:** Developers will:
- Request ablation validation for >50% of attention-based hypotheses
- Report understanding that "attention β‰  causation" (>80% agreement in survey)
- Measured by: telemetry (how often developers cross-reference Attention + Ablation), survey responses

---

## 4. Pipeline Visualization

### Opaque Mechanism Addressed

**Layer-by-layer representation transformation** - the "forward pass" through 32 transformer layers where:
- Input embeddings gradually transform into output logits
- Each layer applies: self-attention β†’ FFN β†’ layer norm β†’ residual connection
- Intermediate representations are high-dimensional (hidden_dim = 4096 for Code Llama 7B) and semantically opaque

**Sources of opacity:**
- No visibility into intermediate states (black box from input β†’ output)
- Unclear where "understanding" emerges (early vs late layers?)
- Unknown bottlenecks (which layers struggle most? where does model get confused?)
- Residual connections create complex information flow (not simple feedforward)

### Transformation to Interpretability

**Primary contribution:** Temporal decomposition + interpretable layer-level signals

1. **Layer-by-layer scrubbing:** Timeline UI to "scrub" through layers 0β†’32, showing how representations evolve:
   - Visualize as swimlane: horizontal axis = layers, vertical axis = tokens
   - Each "swim" represents one token's journey through the architecture
   - Color intensity = uncertainty (entropy) at that layer

2. **Interpretable signals (not raw activations):**
   - **Residual-norm z-scores:** How much each layer changes the representation
     ```
     z_l = (||x_l|| - ΞΌ_l) / Οƒ_l
     ```
     - High z β†’ layer is "working hard" (significant transformation)
     - Low z β†’ layer passes information through with minimal change

   - **Entropy shift:** Change in output entropy from pre- to post-layer
     ```
     Ξ”H_l = H(logits after layer l) - H(logits before layer l)
     ```
     - Negative Ξ”H β†’ layer reduces uncertainty (good)
     - Positive Ξ”H β†’ layer increases uncertainty (confusion)

   - **Attention-flow saturation:** % of attention mass concentrated on top-m positions
     ```
     Saturation = βˆ‘(top-m attention weights) / βˆ‘(all attention weights)
     ```
     - High saturation β†’ focused attention (model is certain about sources)
     - Low saturation β†’ diffuse attention (model is uncertain)

   - **Router load (MoE only):** Which experts activate in mixture-of-experts layers
     - Expert IDs + gate weights
     - Imbalance metric (are all experts used equally?)

3. **Swimlane/Timeline view:**
   - Lanes: Tokenizer β†’ Embeddings β†’ Layer 1 β†’ ... β†’ Layer 32 β†’ Logits β†’ Sampler β†’ Post-proc/Tests
   - Rectangle length = time per stage (latency profiling)
   - Color = uncertainty (entropy)
   - Hover = per-stage stats (residual-z, Ξ”H, saturation, latency)

4. **Bottleneck identification:**
   - Flag layers in top-q percentile (e.g., top 10%) of:
     - Latency (slowest layers)
     - Residual-norm spikes (largest transformations)
     - Entropy jumps (biggest increases in uncertainty)
   - Correlate bottlenecks with sampler behavior (does entropy spike β†’ hallucination?)

### What Code Generation Decisions It Reveals

**Specific insights for developers:**

1. **Emergence of syntax:** At which layer does model "realize" it's generating a function?
   - Likely when indentation pattern appears, `def` keyword generated
   - Measure: residual-norm spike at layer where syntactic structure emerges
   - Example: Layer 5 shows high residual-z when generating `def factorial(n):`

2. **Semantic shift:** Can we observe when model transitions from "reading prompt" (early layers) to "generating code" (late layers)?
   - Early layers: high attention to prompt tokens, low residual-norm
   - Mid layers: residual-norm increases (processing semantics)
   - Late layers: attention shifts to recent generated tokens (auto-regressive generation)

3. **Error propagation:** If model generates bug at token T, can we trace back to which layer introduced the error?
   - Look for entropy spike or residual-norm anomaly in layers before T
   - Example: Model generates wrong variable name at token 50 β†’ entropy jumps at layer 18 β†’ investigate what happened at layer 18

4. **Compute allocation:** Which layers consume most compute? (Implications for model optimization)
   - Latency profiling shows bottleneck layers
   - Pruning candidates: layers with low residual-norm (minimal transformation) + high latency

### Extension Beyond Existing Literature

**Bansal et al. (2022) on in-context learning at 66B scale:**
- Analyzed layer contributions to ICL via interventions
- Focused on language tasks (not code)
- No interactive visualization for non-ML-experts
- Static analysis (not real-time exploration)

**Your extension:**
- **Code-specific annotations:** Label layers with code-relevant milestones:
  - "Layer 8: syntax tree formed"
  - "Layer 20: variable scope resolved"
  - "Layer 28: stylistic formatting applied"
- **Multi-token tracking:** Show pipeline evolution across multiple generated tokens (not just one forward pass)
- **Developer-friendly abstractions:** Avoid technical jargon (hidden states, residual stream) β†’ use "understanding evolution", "decision stages"
- **Comparative pipelines:** Show pipeline for correct vs buggy outputs side-by-side (where do they diverge?)

**Interpretability papers (general):**
- Focus on probing classifiers to test "what does layer X know?"
- Require training additional models (probes)
- Not interactive or real-time

**Your extension:**
- **No additional training:** Use intrinsic signals (residual-norm, entropy)
- **Real-time:** Compute signals during generation (< 10ms overhead)
- **Actionable:** Developer can bypass layers to test hypotheses

### Novel Contribution: Layer-Level Taxonomy for Code Generation

**Gap in literature:** No established taxonomy of what each transformer layer does during **code generation** specifically.

- Zheng et al. (2025) survey attention heads, but not layer-level roles
- Interpretability papers focus on language tasks (next-word prediction, sentiment, Q&A)
- Code generation is different: requires syntax, semantics, formatting, executable correctness

**Your contribution:** Empirically identify layer specialization for code:
1. **Layers 1-5: Tokenization + basic syntax**
   - Residual-norm spikes when processing delimiters, keywords
   - Attention focuses on local syntax (brackets, colons)

2. **Layers 6-15: Semantic understanding**
   - Residual-norm increases during identifier resolution
   - Attention to variable declarations, type hints, docstrings
   - Entropy decreases (model becomes more certain about semantics)

3. **Layers 16-25: Reasoning/logic**
   - Residual-norm spikes during control flow generation (if/else, loops)
   - Attention to prompt logic + recent generated code
   - Entropy may increase temporarily (exploring logical alternatives)

4. **Layers 26-32: Fluency/formatting**
   - Low residual-norm (minor refinements)
   - Attention to recent tokens (auto-regressive)
   - Entropy decreases (finalizing token choices)

**If validated, this would be novel for code LLMs and could be Paper 1 contribution.**

### Developer-Facing Research Questions

**RQ1.13: Layer Decision Identification**
Can developers identify at which layer the model "decides" on code structure (e.g., loop vs conditional)?

**Hypothesis H1.13:** Developers using pipeline visualization will:
- Correctly identify decision layer within Β±3 layers in >55% of cases
- Report increased understanding of model's "thinking process" (>75% agreement)
- Measured by: layer identification accuracy (ground truth = residual-norm + entropy spike analysis), survey responses

**RQ1.14: Next-Token Prediction Improvement**
Does seeing pipeline evolution improve developers' ability to predict subsequent tokens?

**Hypothesis H1.14 (from spec):** Pipeline summaries improve next-token prediction accuracy
- Developers predict next token after seeing pipeline β†’ compare with baseline (no pipeline)
- Expected improvement: +10-15 percentage points in top-3 accuracy
- Measured by: prediction task (5 examples per participant)

**RQ1.15: Error Localization**
Can developers use pipeline visualization to diagnose *where* in the model an error originates?

**Hypothesis H1.15:** Developers will:
- Identify error-causing layer within Β±5 layers in >50% of cases
- Reduce time to diagnose error source by β‰₯20% vs baseline
- Measured by: layer identification accuracy, time to diagnosis

**RQ1.16: Actionable Insights for Prompting**
Can developers use layer knowledge to improve prompts?

**Hypothesis H1.16:** After seeing pipeline, developers will:
- Adjust prompts to provide more context for early layers (syntax/semantics) in >30% of cases
- Report understanding of "what the model needs" (>70% agreement)
- Measured by: prompt modification patterns in telemetry, survey responses

---

## Cross-Cutting Contributions

### 1. Unified Glass-Box Dashboard

**Gap in literature:** Prior work (Kou et al., Paltenghi et al., Zhao et al.) focuses on **single mechanisms** in isolation.

**Your dashboard integrates:**
- **Attention** (spatial attribution)
- **Token Size & Confidence** (probabilistic uncertainty + tokenization)
- **Ablation** (causal attribution)
- **Pipeline** (temporal evolution)

**Developer can triangulate across multiple lenses:**
- Example: "Low confidence + scattered attention + early-layer bottleneck β†’ likely hallucination"
- Example: "High confidence + focused attention + but ablating head X fixes bug β†’ head X is overriding correct information"

**This holistic view is novel for code generation interpretability.**

### 2. Task-Based Developer Study

**Gap:** Most interpretability papers evaluate on:
- Synthetic tasks (toy models, simple examples)
- Researcher-driven analysis (no end-users)
- Post-hoc metrics (accuracy, perplexity)

**Your study evaluates with:**
- **~10 software engineers** doing realistic code tasks (bug detection, code review, prompt optimization)
- **In-the-loop**: Developers use visualizations during task (not passive observation)
- **Actionable interpretability**: Measure whether visualizations improve task performance (time, accuracy, trust)

**This is HCI-grounded interpretability research**, not just ML analysis.

### 3. Code Generation Domain Specificity

**Gap:** Explainability surveys (Zhao et al.) are domain-agnostic. Code has unique properties:
- **Syntactic correctness is binary** (parsable or not) β†’ enables AST-based metrics
- **Semantic correctness is testable** (unit tests) β†’ enables test-based metrics
- **Developer expertise varies** (junior vs senior) β†’ enables expertise-based analysis

**Your visualizations tailored to code:**
- **Syntax highlighting** in attention maps (keywords, identifiers, operators color-coded)
- **Tokenization awareness** for identifiers (rare in NLP interpretability)
- **Ablation targeting code-specific heads** (bracket matching, indentation, API usage)
- **Pipeline stages mapped to code generation phases** (syntax β†’ semantics β†’ logic β†’ formatting)

### 4. Interventionist Interpretability

**Gap:** Most explainability tools are **passive** (show model behavior).

**Your dashboard is **active**:**
- **Ablation allows causal intervention** ("What if I remove this head?")
- **Confidence allows alternative exploration** ("What else could the model have generated?")
- **Pipeline allows temporal investigation** ("Where did the model's understanding emerge?")

**Developers don't just observe - they manipulate and test hypotheses.**

**This is closer to scientist-model interaction (hypothesis-driven) than user-model consumption (passive).**

---

## Literature Positioning Summary

| Your Contribution | Related Work | Gap You Address |
|-------------------|--------------|-----------------|
| **Attention Viz** | Kou et al. (2024) - attention alignment | Interactive, per-head, code-specific, hypothesis-driven |
| **Token Confidence** | Zhao et al. (2024) - prob explanations | Tokenization awareness, code thresholds, bug prediction |
| **Ablation Viz** | Wang et al. (2022) - mechanistic interpretability | Developer-facing, real-time, code metrics (tests/AST) |
| **Pipeline Viz** | Bansal et al. (2022) - layer interventions | Code-specific stages, interpretable signals, interactive |
| **Unified Dashboard** | - | First multi-mechanism glass-box for code LLMs |
| **Developer Study** | Paltenghi et al. (2022) - eye-tracking | Task-based, in-the-loop, actionable metrics |
| **Code Specificity** | - | Syntax/test metrics, tokenization, developer expertise |
| **Interventionist** | - | Ablation, alternatives, hypothesis testing |

---

## Thesis Structure Suggestions

### Chapter 1: Introduction
- **Motivation:** Developers treat LLMs as black boxes β†’ trust issues, debugging difficulties
- **Gap:** Prior work lacks interactive, developer-facing, multi-mechanism dashboards for code
- **Contribution:** First glass-box dashboard integrating 4 interpretability lenses + developer study

### Chapter 2: Literature Review
- **Section 2.1:** Attention in LLMs (Zheng et al., Kou et al.)
- **Section 2.2:** Explainability methods (Zhao et al.)
- **Section 2.3:** Code generation LLMs (Bistarelli et al.)
- **Section 2.4:** Developer-AI interaction (Paltenghi et al.)
- **Section 2.5:** Mechanistic interpretability (Wang et al., Bansal et al.)

### Chapter 3: Methodology (RQ1 Focus)
- **Section 3.1:** Attention Visualization
- **Section 3.2:** Token Size & Confidence Visualization
- **Section 3.3:** Ablation Visualization
- **Section 3.4:** Pipeline Visualization
- **Section 3.5:** Dashboard Integration

### Chapter 4: User Study Design
- **Section 4.1:** Participants (n=18-24 software engineers)
- **Section 4.2:** Tasks (T1, T2, T3)
- **Section 4.3:** Metrics (quantitative + qualitative)
- **Section 4.4:** Protocol (within-subjects, Latin square)

### Chapter 5: Results
- **Section 5.1:** RQ1.1-RQ1.4 (Attention)
- **Section 5.2:** RQ1.5-RQ1.8 (Token Confidence)
- **Section 5.3:** RQ1.9-RQ1.12 (Ablation)
- **Section 5.4:** RQ1.13-RQ1.16 (Pipeline)
- **Section 5.5:** Cross-Cutting Themes

### Chapter 6: Discussion
- **Section 6.1:** Interpretability for Developers (not just researchers)
- **Section 6.2:** Code-Specific Insights (tokenization, syntax, tests)
- **Section 6.3:** Limitations & Future Work

### Chapter 7: Conclusion
- **Summary of Contributions**
- **Implications for Practice** (tool design for developers)
- **Implications for Research** (novel layer taxonomy, ablation as debugging)

---

## ICML Paper 1 Suggestions

**Title:** "Making Transformer Architecture Transparent for Code Generation: A Developer-Centric Study"

**Abstract Structure:**
1. **Problem:** Developers use code LLMs as black boxes β†’ trust/debugging issues
2. **Gap:** Prior interpretability work not developer-facing or code-specific
3. **Solution:** Glass-box dashboard with 4 visualizations (Attention, Token Confidence, Ablation, Pipeline)
4. **Study:** n=18-24 software engineers on 3 code tasks
5. **Results:** (placeholder for actual results)
   - Attention viz improves source identification (H1-Attn)
   - Token confidence flags predict bugs (H2-Tok, AUC β‰₯ 0.70)
   - Ablation reduces debugging iterations (H3-Abl, -20%)
   - Pipeline improves error localization (H4-Pipe)
6. **Contribution:** First empirical evidence that multi-mechanism interpretability tools improve developer performance on code tasks

**Sections:**
1. Introduction
2. Related Work
3. Dashboard Design (4 visualizations)
4. User Study
5. Results
6. Discussion
7. Conclusion

**Target:** ICML 2026 (submission ~January 2026)

---

**End of RQ1 Mapping Document**