Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
| # RQ1 Mapping: How Each Visualization Addresses Architectural Transparency | |
| **Research Question 1:** "How can we transform opaque architectural mechanisms (multi-head attention, feed-forward networks, mixture-of-experts routing) into interpretable visual representations that reveal how LLMs make code generation decisions?" | |
| **Document Version:** 1.0 | |
| **Date:** 2025-11-01 | |
| **Author:** Gary Boon, Northumbria University | |
| --- | |
| ## Executive Summary | |
| This document maps each of the 4 visualizations (Attention, Token Size & Confidence, Ablation, Pipeline) to RQ1, explaining: | |
| 1. What opaque mechanism each visualization addresses | |
| 2. How it transforms that mechanism into an interpretable representation | |
| 3. What code generation decisions it reveals | |
| 4. How it extends beyond existing literature | |
| 5. Specific research sub-questions for the user study | |
| --- | |
| ## 1. Attention Visualization (QKV Explorer) | |
| ### Opaque Mechanism Addressed | |
| **Multi-head self-attention** - the fundamental mechanism by which transformers weight input tokens when generating each output token. | |
| **Sources of opacity:** | |
| - 32+ heads operating in parallel (Code Llama 7B has 32 heads Γ 32 layers = 1,024 attention heads) | |
| - High-dimensional attention score matrices (hidden_dim Γ seq_length) | |
| - Non-interpretable weight distributions across heads | |
| - Unclear semantic specialization of individual heads | |
| ### Transformation to Interpretability | |
| **Primary contribution:** Spatial decomposition + interactive querying | |
| 1. **Head-level decomposition:** Display each attention head's behavior separately, allowing identification of specialized roles: | |
| - Syntactic heads focusing on matching brackets, indentation | |
| - Semantic heads attending to variable definitions, type hints | |
| - Positional heads capturing code structure (function boundaries, control flow) | |
| 2. **Token-to-token attribution:** Interactive heat maps showing which prompt tokens each generated code token attends to, with normalized attention weights (0-1 scale): | |
| - Rows = generated tokens | |
| - Columns = prompt + context tokens | |
| - Heat intensity = attention weight | |
| - Hover = exact weights + source spans | |
| 3. **Attention rollout:** Composition of attention across layers (Kovaleva-style) to show information flow from input to output: | |
| ``` | |
| A_rollout = A_L Γ A_(L-1) Γ ... Γ A_1 | |
| ``` | |
| This reveals which input tokens contribute to each output token through the entire network stack. | |
| 4. **Head role grid:** Layer Γ Head matrix with mini-sparklines showing mean attention to token classes: | |
| - Delimiters (brackets, colons, commas) | |
| - Identifiers (variable names, function names) | |
| - Keywords (def, class, if, for) | |
| - Comments (docstrings) | |
| ### What Code Generation Decisions It Reveals | |
| **Specific insights for developers:** | |
| 1. **Identifier resolution:** When model generates `user.name`, which prior prompt tokens did it attend to? | |
| - Expected: variable declaration `user = User(...)`, type hints `user: User`, docstrings describing user object | |
| - Misalignment: over-attending to recent tokens (recency bias) instead of declaration site | |
| 2. **Syntactic correctness:** Do specific heads focus on bracket matching, indentation patterns, or control flow structure? | |
| - Example: Head [Layer 5, Head 3] might specialize in matching opening/closing brackets | |
| - Example: Head [Layer 8, Head 12] might attend to indentation levels for syntactic consistency | |
| 3. **Context utilization:** Is the model actually "reading" the prompt context, or over-attending to recent tokens? | |
| - Recency bias indicator: >70% attention mass on last 5 tokens | |
| - Long-range dependency: attention to tokens >100 positions back | |
| 4. **Error attribution:** When buggy code is generated, can we trace it to misaligned attention? | |
| - Example: Model generates `user.get_name()` but should be `user.name` β attention shows model attended to API doc snippet instead of variable declaration | |
| - Example: Model generates incorrect variable name β attention shows model confused two similar identifiers in context | |
| ### Extension Beyond Existing Literature | |
| **Kou et al. (2024): "Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?"** | |
| - Showed attention misalignment with human programmers | |
| - Used aggregate metrics (averaged across heads/layers) | |
| - Post-hoc analysis (no interactive exploration) | |
| - Passive comparison (developers not in control) | |
| **Your extension:** | |
| - **Interactive head selection:** Developer chooses which head/layer to inspect in real-time | |
| - **Code-specific annotations:** Highlight syntactic elements (keywords, identifiers, operators) with domain-specific color coding | |
| - **Counterfactual queries:** "What if I remove this docstring? How does attention redistribute?" | |
| - **Task-embedded evaluation:** Developers use the tool during actual code review tasks (bug detection, prompt optimization), not just correlation studies | |
| **Paltenghi et al. (2022): "Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration"** | |
| - Eye-tracking study comparing developer attention to model attention | |
| - Focus on code exploration, not generation | |
| - No interactive visualization for developers | |
| **Your extension:** | |
| - **Generative focus:** Attention during code generation, not just comprehension | |
| - **Interactive tool:** Developers manipulate and query attention, not just observe | |
| - **Causal validation:** Attention hypotheses validated via ablation (Section 3) | |
| **Zheng et al. (2025): "Attention Heads of Large Language Models: A Survey"** | |
| - Taxonomy of attention head discovery methods: | |
| 1. Model-free (saliency, gradient-based) | |
| 2. Modeling-required (probing classifiers) | |
| - Primarily for ML researchers analyzing models | |
| **Your positioning:** | |
| - **Model-free + developer-in-the-loop:** No additional training, but leverages human domain expertise for interpretation | |
| - **Novel category:** "Developer-driven interpretability" - non-ML-experts can explore attention patterns and form hypotheses about head roles | |
| ### Developer-Facing Research Questions | |
| **RQ1.1: Head Role Discovery** | |
| Can developers identify which attention heads are responsible for syntactic correctness vs semantic coherence? | |
| **Hypothesis H1.1:** Developers using the attention visualization will correctly identify: | |
| - Syntactic heads (bracket matching, indentation) with >70% accuracy | |
| - Semantic heads (identifier resolution, type inference) with >60% accuracy | |
| - Measured by: agreement with ground truth head roles (established via ablation studies) | |
| **RQ1.2: Error Prediction** | |
| Does seeing attention distributions improve developers' ability to predict model errors? | |
| **Hypothesis H1.2:** Developers with attention visualization will: | |
| - Predict buggy outputs 25% faster than baseline | |
| - Increase bug detection accuracy by β₯15 percentage points | |
| - Measured by: time to flag suspicious tokens, precision/recall of bug predictions | |
| **RQ1.3: Attention-Expectation Alignment** | |
| How do developers' attention expectations differ from model attention patterns? | |
| **Hypothesis H1.3:** Developers will report misalignment in: | |
| - >40% of generated tokens (model attends to unexpected sources) | |
| - Especially for API usage and rare identifiers | |
| - Measured by: developer annotations of "surprising" attention patterns + post-task interviews | |
| **RQ1.4: Recency Bias Awareness** | |
| Can developers identify when the model exhibits recency bias (over-attending to recent tokens)? | |
| **Hypothesis H1.4:** With recency bias flags (>70% attention on last 5 tokens), developers will: | |
| - Correctly identify recency bias cases with >80% accuracy | |
| - Adjust prompts to mitigate bias in >50% of cases | |
| - Measured by: flag accuracy vs ground truth, prompt modification patterns | |
| --- | |
| ## 2. Token Size & Confidence Visualization | |
| ### Opaque Mechanism Addressed | |
| **Probability distribution over vocabulary** at each decoding step + **tokenization granularity** | |
| **Sources of opacity:** | |
| - 32K-50K vocab size (Code Llama) making full distribution uninterpretable | |
| - Softmax scores calibrated to model's training distribution, not developer confidence | |
| - Tokenization artifacts: | |
| - `"user"` tokenized as one token vs `"username"` as two tokens `["user", "name"]` | |
| - Rare identifiers split into nonsensical subwords: `"pytorch"` β `["py", "tor", "ch"]` | |
| - Hidden relationship between entropy and actual error likelihood | |
| ### Transformation to Interpretability | |
| **Primary contribution:** Uncertainty quantification + token granularity exposure | |
| 1. **Per-token confidence scores:** Display top-k alternatives with probabilities: | |
| ``` | |
| "for" at 0.89 | |
| "while" at 0.07 | |
| "if" at 0.03 | |
| ``` | |
| This shows model's uncertainty and plausible alternatives. | |
| 2. **Entropy-based uncertainty:** Shannon entropy as proxy for model uncertainty: | |
| ``` | |
| H = -β p_i log(p_i) | |
| ``` | |
| - High entropy = many plausible alternatives (model is guessing) | |
| - Low entropy = one clear choice (model is confident) | |
| 3. **Tokenization visibility:** Show exact token boundaries (BPE/SentencePiece splits) to reveal when model is uncertain due to subword chunking: | |
| - Visual: token chips with width proportional to byte length | |
| - Chip color/opacity reflects confidence (desaturated = low confidence) | |
| - Example: `get_user_data` might be tokenized as `["get", "_user", "_data"]` (3 tokens) vs `["get_user_data"]` (1 token) | |
| 4. **Hallucination risk indicators:** Flag tokens with high entropy + low maximum probability: | |
| - Entropy β₯ Ο_H (e.g., 1.5 nats) | |
| - Max probability < 0.5 | |
| - This indicates model is "guessing" with no clear preference | |
| 5. **Risk hotspot flags:** Identifiers split into β₯3 subwords AND entropy peak: | |
| - These are statistically more likely to be bugs (to be validated in user study) | |
| - Example: `process_user_data` β `["process", "_user", "_data"]` with H = 1.8 nats β FLAG | |
| ### What Code Generation Decisions It Reveals | |
| **Specific insights for developers:** | |
| 1. **Variable naming:** When model generates `usr` vs `user`, was this high-confidence choice or arbitrary selection from similar alternatives? | |
| - Check top-k: if `["usr": 0.51, "user": 0.48]` β model is uncertain | |
| - Check entropy: if H = 1.2 nats β borderline uncertainty | |
| - Developer can manually select preferred alternative | |
| 2. **API usage:** Does model confidently predict correct method names (e.g., `.append()`) or waver between alternatives (`.add()`, `.push()`, `.insert()`)? | |
| - Low confidence on API calls β likely hallucination or incorrect usage | |
| - High confidence on incorrect API β model has learned wrong pattern (training data issue) | |
| 3. **Tokenization mismatches:** Does splitting `process_data` into `["process", "_data"]` vs `["process_", "data"]` affect model confidence? | |
| - Hypothesis: multi-split identifiers correlate with lower confidence | |
| - Mechanism: model's vocabulary doesn't contain full identifier, so it reconstructs from subwords | |
| - Developer insight: use simpler identifiers (fewer underscores, camelCase) for better model confidence | |
| 4. **Implicit assumptions:** High confidence on incorrect code suggests model has learned wrong patterns: | |
| - Example: model generates `list.append(x)` with 0.95 confidence, but list is actually a numpy array (should be `np.append(list, x)`) | |
| - This reveals model's training data bias (more Python lists than numpy arrays in training set) | |
| ### Extension Beyond Existing Literature | |
| **Zhao et al. (2024): "Explainability for Large Language Models: A Survey"** | |
| - Covers probability-based explanations but mostly: | |
| - Aggregate metrics (perplexity, log-likelihood) | |
| - Not code-specific | |
| - No tokenization awareness | |
| **Your extension:** | |
| - **Code-aware thresholds:** Calibrate "low confidence" thresholds specifically for code tokens: | |
| - Keywords (def, class) typically high confidence | |
| - Identifiers vary (common names high, rare names low) | |
| - Operators high confidence | |
| - Different threshold Ο_H for each category | |
| - **Tokenization pedagogy:** Educate developers on how BPE affects model's "view" of code: | |
| - Most code LLM papers (Bistarelli et al., 2025 review) ignore tokenization effects | |
| - Developers rarely aware that identifier choice affects tokenization | |
| - Your tool makes this visible β potential prompt engineering insight | |
| - **Alternative exploration:** Let developers click on low-confidence tokens to see *why* alternatives were plausible: | |
| - Show attention snippet: which context tokens justified each alternative? | |
| - Link to Attention visualization for deeper investigation | |
| - **Real-time confidence:** Stream confidence scores during generation, not just post-hoc analysis: | |
| - Developer can interrupt generation if confidence drops below threshold | |
| - Useful for interactive coding assistants | |
| ### Novel Contribution: Tokenization Γ Confidence Interaction | |
| **Gap in literature:** Most code generation papers ignore tokenization effects. But: | |
| - `variable_name` (snake_case) vs `variableName` (camelCase) tokenized differently β different confidence profiles | |
| - Short vs long identifier names have different entropy characteristics | |
| - Rare API names may be split into nonsensical subwords β low confidence | |
| **Your visualization makes this visible** - potentially novel for code LLM research. | |
| **Hypothesis:** Multi-split identifiers (β₯3 subwords) + entropy peaks predict bugs better than entropy alone. | |
| ### Developer-Facing Research Questions | |
| **RQ1.5: Confidence-Based Bug Detection** | |
| Can developers use token confidence to identify likely bugs faster than code inspection alone? | |
| **Hypothesis H1.5:** Developers with confidence visualization will: | |
| - Identify bugs 20% faster than baseline | |
| - Increase bug detection precision by β₯10 percentage points | |
| - Measured by: time to identify bug, precision/recall of bug locations | |
| **RQ1.6: Tokenization Awareness** | |
| Does seeing tokenization boundaries change developers' prompt engineering strategies? | |
| **Hypothesis H1.6:** After using token size visualization, developers will: | |
| - Report increased awareness of tokenization (>70% agree in post-survey) | |
| - Adjust identifier naming in prompts (>40% of participants) | |
| - Measured by: survey responses, prompt modification patterns in telemetry | |
| **RQ1.7: Confidence Calibration** | |
| Do high-confidence errors undermine trust more than low-confidence errors? | |
| **Hypothesis H1.7:** Developers will report: | |
| - Lower trust when high-confidence predictions are wrong (β₯1 point on 7-point scale) | |
| - Appropriate trust calibration when confidence aligns with correctness | |
| - Measured by: Brier score (calibration metric), trust survey responses | |
| **RQ1.8: Bug-Risk AUC** | |
| Do entropy Γ token-size hotspot flags predict actual bug locations? | |
| **Hypothesis H1.8 (from spec):** AUC β₯ 0.70 for hotspot predictor vs actual bug locations | |
| - Measured by: ROC curve analysis, ground truth = unit test failures + manual bug annotations | |
| --- | |
| ## 3. Ablation Visualization | |
| ### Opaque Mechanism Addressed | |
| **Causal attribution of model components** - specifically: | |
| - Which attention heads are critical vs redundant? | |
| - Which layers perform feature extraction vs reasoning? | |
| - Which feed-forward networks (FFN) contribute to code-specific decisions? | |
| **Sources of opacity:** | |
| - Distributed computation across 32 layers Γ 32 heads = 1,024 attention heads (Code Llama 7B) | |
| - Non-linear interactions between components (head X in layer Y may depend on head Z in layer W) | |
| - Unclear redundancy: can model compensate if one head is removed? | |
| - Black-box causality: correlation (attention weights) β causation (actual influence) | |
| ### Transformation to Interpretability | |
| **Primary contribution:** Interactive causal intervention + comparative analysis | |
| 1. **Selective ablation:** Developer toggles individual heads, entire layers, or FFN blocks off: | |
| - Head masking: zero out attention weights or set to uniform distribution | |
| - Layer bypass: skip layer entirely, pass residual stream through unchanged | |
| - FFN gate clamp: disable feed-forward network in specific layer | |
| 2. **Before/after comparison:** Side-by-side display of original output vs ablated output: | |
| - Unified diff showing changed tokens (color-coded: added/removed/modified) | |
| - Line-level changes for multi-line code generation | |
| - Structural changes (AST diff) to show semantic impact | |
| 3. **Quantitative impact metrics:** | |
| - **Token-level change rate:** % tokens that changed after ablation | |
| - **Semantic similarity:** CodeBLEU, embedding distance (cosine similarity) | |
| - **Syntactic correctness:** AST parse success (can code be parsed?) | |
| - **Functional correctness:** Unit tests passed (does code work?) | |
| - **Static analysis:** ruff/bandit warnings (code quality/security issues) | |
| - **Ξlog-prob:** Change in log-probability of each token | |
| 4. **Per-token delta heat:** Visualize Ξlog-prob and Ξentropy per token: | |
| - Small multiples showing impact of ablating each of top-k heads | |
| - Identify most-impactful heads (Ξlog-prob β₯ Ο_Ξ, e.g., 0.1) | |
| 5. **Hypothesis testing workflow:** | |
| - Developer predicts impact before ablation ("I think head [12,5] handles bracket matching") | |
| - Execute ablation | |
| - Verify prediction (did brackets break?) | |
| - Iteratively refine mental model of head roles | |
| ### What Code Generation Decisions It Reveals | |
| **Specific insights for developers:** | |
| 1. **Critical heads:** Identify which heads, if removed, break code generation entirely: | |
| - Example: ablating head [Layer 3, Head 7] causes all bracket matching to fail β this head is critical for syntactic correctness | |
| - Implication: model relies on specific architectural component for basic syntax | |
| 2. **Redundant heads:** Which heads can be removed with minimal impact? | |
| - Example: ablating head [Layer 25, Head 14] changes only 2% of tokens β this head is redundant | |
| - Implication: model is over-parameterized (could be pruned for efficiency) | |
| 3. **Layer specialization:** Early layers (1-8) handle tokenization/syntax, mid layers (9-20) handle semantics, late layers (21-32) handle coherence? | |
| - Hypothesis to test via layer bypass ablations | |
| - Example: bypassing layer 5 breaks indentation; bypassing layer 15 breaks variable scoping | |
| 4. **Bug localization:** If ablating head X fixes a bug, that head is likely causing the error: | |
| - Example: model generates `user.get_name()` (wrong) β ablate head [18,3] β model generates `user.name` (correct) | |
| - Causal diagnosis: head [18,3] is attending to incorrect API documentation context | |
| ### Extension Beyond Existing Literature | |
| **Mechanistic interpretability literature (Wang et al., 2022 on GPT-2 circuits):** | |
| - Focuses on individual mechanisms (e.g., indirect object identification circuit) | |
| - Requires manual circuit discovery by ML researchers (slow, expert-driven) | |
| - Not interactive or developer-facing | |
| **Your extension:** | |
| - **Developer-driven exploration:** Non-experts (software engineers) can perform ablations without ML knowledge | |
| - **Code generation focus:** Ablations tailored to code tasks (syntactic correctness, API usage, variable scoping) | |
| - **Real-time feedback:** Immediate re-generation with ablated model (not batch analysis) | |
| - **Task-oriented ablation:** During bug fixing, developer can ablate to localize error source ("Which component is causing this bug?") | |
| **Bansal et al. (2022): "Rethinking the Role of Scale for In-Context Learning"** | |
| - Analyzed layer contributions to ICL via interventions | |
| - Focused on language tasks (not code) | |
| - No interactive visualization for non-ML-experts | |
| **Your extension:** | |
| - **Interactive ablation:** Developer controls which components to ablate | |
| - **Code-specific metrics:** Unit tests, AST parse, lints (not just perplexity) | |
| - **Hypothesis-driven workflow:** Developer predicts impact before seeing result | |
| ### Novel Contribution: Ablation as Debugging Tool | |
| **Gap in literature:** Ablation studies are typically **research tools** (for ML researchers analyzing models), not **developer tools** (for software engineers using models). | |
| **Your contribution:** Reframe ablation as **interactive debugging**: | |
| - "Why did the model generate this bug?" β "Let me turn off components until it works correctly" β identifies faulty component | |
| - This is analogous to debuggers for traditional code (set breakpoints, step through execution) | |
| - But for neural networks: "ablation breakpoints" (turn off heads/layers), "step through architecture" (layer-by-layer pipeline) | |
| **Potential impact:** | |
| - Developers without ML training can perform causal analysis | |
| - Faster bug diagnosis in LLM-generated code | |
| - Insights for model developers (which components are most critical for code generation?) | |
| ### Attribution Ground Truth (Methodology) | |
| A source token T_src is "influential" for generated token T_gen if: | |
| 1. T_src lies in top-k rollout sources (from Attention Visualization, k=8) | |
| 2. Masking the minimal set of heads H that carry attention from T_src β T_gen causes: | |
| - Ξlog-prob β₯ Ο_Ξ (e.g., 0.1) on T_gen, OR | |
| - Flip in unit test outcome (pass β fail or vice versa) | |
| This operational definition enables: | |
| - Reproducible measurement of "attribution accuracy" | |
| - Validation of attention-based hypotheses via ablation | |
| - Inter-rater reliability (two researchers apply same criteria) | |
| ### Developer-Facing Research Questions | |
| **RQ1.9: Ablation-Assisted Debugging** | |
| Can developers without ML expertise successfully use ablation to identify causes of buggy code generation? | |
| **Hypothesis H1.9:** Developers using ablation tool will: | |
| - Correctly identify causal components (head/layer causing bug) in >60% of cases | |
| - Reduce time to diagnose bug by β₯25% vs baseline | |
| - Measured by: success rate of causal identification, time to diagnosis | |
| **RQ1.10: Mental Model Formation** | |
| Do developers form accurate mental models of layer/head specialization after using ablation tool? | |
| **Hypothesis H1.10:** After ablation exploration, developers will: | |
| - Correctly categorize heads as syntactic/semantic/positional with >65% accuracy | |
| - Describe layer roles (early=syntax, mid=semantics, late=coherence) with >70% agreement | |
| - Measured by: post-task categorization quiz, qualitative interview themes | |
| **RQ1.11: Iteration Reduction** | |
| Does ablation tool reduce iterations needed to achieve passing solution? | |
| **Hypothesis H1.11 (from spec):** Ablation tool reduces iterations to passing solution by β₯20% | |
| - Measured by: number of prompt modifications + code edits before all unit tests pass | |
| **RQ1.12: Causal vs Descriptive Understanding** | |
| Do developers distinguish between correlation (attention) and causation (ablation)? | |
| **Hypothesis H1.12:** Developers will: | |
| - Request ablation validation for >50% of attention-based hypotheses | |
| - Report understanding that "attention β causation" (>80% agreement in survey) | |
| - Measured by: telemetry (how often developers cross-reference Attention + Ablation), survey responses | |
| --- | |
| ## 4. Pipeline Visualization | |
| ### Opaque Mechanism Addressed | |
| **Layer-by-layer representation transformation** - the "forward pass" through 32 transformer layers where: | |
| - Input embeddings gradually transform into output logits | |
| - Each layer applies: self-attention β FFN β layer norm β residual connection | |
| - Intermediate representations are high-dimensional (hidden_dim = 4096 for Code Llama 7B) and semantically opaque | |
| **Sources of opacity:** | |
| - No visibility into intermediate states (black box from input β output) | |
| - Unclear where "understanding" emerges (early vs late layers?) | |
| - Unknown bottlenecks (which layers struggle most? where does model get confused?) | |
| - Residual connections create complex information flow (not simple feedforward) | |
| ### Transformation to Interpretability | |
| **Primary contribution:** Temporal decomposition + interpretable layer-level signals | |
| 1. **Layer-by-layer scrubbing:** Timeline UI to "scrub" through layers 0β32, showing how representations evolve: | |
| - Visualize as swimlane: horizontal axis = layers, vertical axis = tokens | |
| - Each "swim" represents one token's journey through the architecture | |
| - Color intensity = uncertainty (entropy) at that layer | |
| 2. **Interpretable signals (not raw activations):** | |
| - **Residual-norm z-scores:** How much each layer changes the representation | |
| ``` | |
| z_l = (||x_l|| - ΞΌ_l) / Ο_l | |
| ``` | |
| - High z β layer is "working hard" (significant transformation) | |
| - Low z β layer passes information through with minimal change | |
| - **Entropy shift:** Change in output entropy from pre- to post-layer | |
| ``` | |
| ΞH_l = H(logits after layer l) - H(logits before layer l) | |
| ``` | |
| - Negative ΞH β layer reduces uncertainty (good) | |
| - Positive ΞH β layer increases uncertainty (confusion) | |
| - **Attention-flow saturation:** % of attention mass concentrated on top-m positions | |
| ``` | |
| Saturation = β(top-m attention weights) / β(all attention weights) | |
| ``` | |
| - High saturation β focused attention (model is certain about sources) | |
| - Low saturation β diffuse attention (model is uncertain) | |
| - **Router load (MoE only):** Which experts activate in mixture-of-experts layers | |
| - Expert IDs + gate weights | |
| - Imbalance metric (are all experts used equally?) | |
| 3. **Swimlane/Timeline view:** | |
| - Lanes: Tokenizer β Embeddings β Layer 1 β ... β Layer 32 β Logits β Sampler β Post-proc/Tests | |
| - Rectangle length = time per stage (latency profiling) | |
| - Color = uncertainty (entropy) | |
| - Hover = per-stage stats (residual-z, ΞH, saturation, latency) | |
| 4. **Bottleneck identification:** | |
| - Flag layers in top-q percentile (e.g., top 10%) of: | |
| - Latency (slowest layers) | |
| - Residual-norm spikes (largest transformations) | |
| - Entropy jumps (biggest increases in uncertainty) | |
| - Correlate bottlenecks with sampler behavior (does entropy spike β hallucination?) | |
| ### What Code Generation Decisions It Reveals | |
| **Specific insights for developers:** | |
| 1. **Emergence of syntax:** At which layer does model "realize" it's generating a function? | |
| - Likely when indentation pattern appears, `def` keyword generated | |
| - Measure: residual-norm spike at layer where syntactic structure emerges | |
| - Example: Layer 5 shows high residual-z when generating `def factorial(n):` | |
| 2. **Semantic shift:** Can we observe when model transitions from "reading prompt" (early layers) to "generating code" (late layers)? | |
| - Early layers: high attention to prompt tokens, low residual-norm | |
| - Mid layers: residual-norm increases (processing semantics) | |
| - Late layers: attention shifts to recent generated tokens (auto-regressive generation) | |
| 3. **Error propagation:** If model generates bug at token T, can we trace back to which layer introduced the error? | |
| - Look for entropy spike or residual-norm anomaly in layers before T | |
| - Example: Model generates wrong variable name at token 50 β entropy jumps at layer 18 β investigate what happened at layer 18 | |
| 4. **Compute allocation:** Which layers consume most compute? (Implications for model optimization) | |
| - Latency profiling shows bottleneck layers | |
| - Pruning candidates: layers with low residual-norm (minimal transformation) + high latency | |
| ### Extension Beyond Existing Literature | |
| **Bansal et al. (2022) on in-context learning at 66B scale:** | |
| - Analyzed layer contributions to ICL via interventions | |
| - Focused on language tasks (not code) | |
| - No interactive visualization for non-ML-experts | |
| - Static analysis (not real-time exploration) | |
| **Your extension:** | |
| - **Code-specific annotations:** Label layers with code-relevant milestones: | |
| - "Layer 8: syntax tree formed" | |
| - "Layer 20: variable scope resolved" | |
| - "Layer 28: stylistic formatting applied" | |
| - **Multi-token tracking:** Show pipeline evolution across multiple generated tokens (not just one forward pass) | |
| - **Developer-friendly abstractions:** Avoid technical jargon (hidden states, residual stream) β use "understanding evolution", "decision stages" | |
| - **Comparative pipelines:** Show pipeline for correct vs buggy outputs side-by-side (where do they diverge?) | |
| **Interpretability papers (general):** | |
| - Focus on probing classifiers to test "what does layer X know?" | |
| - Require training additional models (probes) | |
| - Not interactive or real-time | |
| **Your extension:** | |
| - **No additional training:** Use intrinsic signals (residual-norm, entropy) | |
| - **Real-time:** Compute signals during generation (< 10ms overhead) | |
| - **Actionable:** Developer can bypass layers to test hypotheses | |
| ### Novel Contribution: Layer-Level Taxonomy for Code Generation | |
| **Gap in literature:** No established taxonomy of what each transformer layer does during **code generation** specifically. | |
| - Zheng et al. (2025) survey attention heads, but not layer-level roles | |
| - Interpretability papers focus on language tasks (next-word prediction, sentiment, Q&A) | |
| - Code generation is different: requires syntax, semantics, formatting, executable correctness | |
| **Your contribution:** Empirically identify layer specialization for code: | |
| 1. **Layers 1-5: Tokenization + basic syntax** | |
| - Residual-norm spikes when processing delimiters, keywords | |
| - Attention focuses on local syntax (brackets, colons) | |
| 2. **Layers 6-15: Semantic understanding** | |
| - Residual-norm increases during identifier resolution | |
| - Attention to variable declarations, type hints, docstrings | |
| - Entropy decreases (model becomes more certain about semantics) | |
| 3. **Layers 16-25: Reasoning/logic** | |
| - Residual-norm spikes during control flow generation (if/else, loops) | |
| - Attention to prompt logic + recent generated code | |
| - Entropy may increase temporarily (exploring logical alternatives) | |
| 4. **Layers 26-32: Fluency/formatting** | |
| - Low residual-norm (minor refinements) | |
| - Attention to recent tokens (auto-regressive) | |
| - Entropy decreases (finalizing token choices) | |
| **If validated, this would be novel for code LLMs and could be Paper 1 contribution.** | |
| ### Developer-Facing Research Questions | |
| **RQ1.13: Layer Decision Identification** | |
| Can developers identify at which layer the model "decides" on code structure (e.g., loop vs conditional)? | |
| **Hypothesis H1.13:** Developers using pipeline visualization will: | |
| - Correctly identify decision layer within Β±3 layers in >55% of cases | |
| - Report increased understanding of model's "thinking process" (>75% agreement) | |
| - Measured by: layer identification accuracy (ground truth = residual-norm + entropy spike analysis), survey responses | |
| **RQ1.14: Next-Token Prediction Improvement** | |
| Does seeing pipeline evolution improve developers' ability to predict subsequent tokens? | |
| **Hypothesis H1.14 (from spec):** Pipeline summaries improve next-token prediction accuracy | |
| - Developers predict next token after seeing pipeline β compare with baseline (no pipeline) | |
| - Expected improvement: +10-15 percentage points in top-3 accuracy | |
| - Measured by: prediction task (5 examples per participant) | |
| **RQ1.15: Error Localization** | |
| Can developers use pipeline visualization to diagnose *where* in the model an error originates? | |
| **Hypothesis H1.15:** Developers will: | |
| - Identify error-causing layer within Β±5 layers in >50% of cases | |
| - Reduce time to diagnose error source by β₯20% vs baseline | |
| - Measured by: layer identification accuracy, time to diagnosis | |
| **RQ1.16: Actionable Insights for Prompting** | |
| Can developers use layer knowledge to improve prompts? | |
| **Hypothesis H1.16:** After seeing pipeline, developers will: | |
| - Adjust prompts to provide more context for early layers (syntax/semantics) in >30% of cases | |
| - Report understanding of "what the model needs" (>70% agreement) | |
| - Measured by: prompt modification patterns in telemetry, survey responses | |
| --- | |
| ## Cross-Cutting Contributions | |
| ### 1. Unified Glass-Box Dashboard | |
| **Gap in literature:** Prior work (Kou et al., Paltenghi et al., Zhao et al.) focuses on **single mechanisms** in isolation. | |
| **Your dashboard integrates:** | |
| - **Attention** (spatial attribution) | |
| - **Token Size & Confidence** (probabilistic uncertainty + tokenization) | |
| - **Ablation** (causal attribution) | |
| - **Pipeline** (temporal evolution) | |
| **Developer can triangulate across multiple lenses:** | |
| - Example: "Low confidence + scattered attention + early-layer bottleneck β likely hallucination" | |
| - Example: "High confidence + focused attention + but ablating head X fixes bug β head X is overriding correct information" | |
| **This holistic view is novel for code generation interpretability.** | |
| ### 2. Task-Based Developer Study | |
| **Gap:** Most interpretability papers evaluate on: | |
| - Synthetic tasks (toy models, simple examples) | |
| - Researcher-driven analysis (no end-users) | |
| - Post-hoc metrics (accuracy, perplexity) | |
| **Your study evaluates with:** | |
| - **~10 software engineers** doing realistic code tasks (bug detection, code review, prompt optimization) | |
| - **In-the-loop**: Developers use visualizations during task (not passive observation) | |
| - **Actionable interpretability**: Measure whether visualizations improve task performance (time, accuracy, trust) | |
| **This is HCI-grounded interpretability research**, not just ML analysis. | |
| ### 3. Code Generation Domain Specificity | |
| **Gap:** Explainability surveys (Zhao et al.) are domain-agnostic. Code has unique properties: | |
| - **Syntactic correctness is binary** (parsable or not) β enables AST-based metrics | |
| - **Semantic correctness is testable** (unit tests) β enables test-based metrics | |
| - **Developer expertise varies** (junior vs senior) β enables expertise-based analysis | |
| **Your visualizations tailored to code:** | |
| - **Syntax highlighting** in attention maps (keywords, identifiers, operators color-coded) | |
| - **Tokenization awareness** for identifiers (rare in NLP interpretability) | |
| - **Ablation targeting code-specific heads** (bracket matching, indentation, API usage) | |
| - **Pipeline stages mapped to code generation phases** (syntax β semantics β logic β formatting) | |
| ### 4. Interventionist Interpretability | |
| **Gap:** Most explainability tools are **passive** (show model behavior). | |
| **Your dashboard is **active**:** | |
| - **Ablation allows causal intervention** ("What if I remove this head?") | |
| - **Confidence allows alternative exploration** ("What else could the model have generated?") | |
| - **Pipeline allows temporal investigation** ("Where did the model's understanding emerge?") | |
| **Developers don't just observe - they manipulate and test hypotheses.** | |
| **This is closer to scientist-model interaction (hypothesis-driven) than user-model consumption (passive).** | |
| --- | |
| ## Literature Positioning Summary | |
| | Your Contribution | Related Work | Gap You Address | | |
| |-------------------|--------------|-----------------| | |
| | **Attention Viz** | Kou et al. (2024) - attention alignment | Interactive, per-head, code-specific, hypothesis-driven | | |
| | **Token Confidence** | Zhao et al. (2024) - prob explanations | Tokenization awareness, code thresholds, bug prediction | | |
| | **Ablation Viz** | Wang et al. (2022) - mechanistic interpretability | Developer-facing, real-time, code metrics (tests/AST) | | |
| | **Pipeline Viz** | Bansal et al. (2022) - layer interventions | Code-specific stages, interpretable signals, interactive | | |
| | **Unified Dashboard** | - | First multi-mechanism glass-box for code LLMs | | |
| | **Developer Study** | Paltenghi et al. (2022) - eye-tracking | Task-based, in-the-loop, actionable metrics | | |
| | **Code Specificity** | - | Syntax/test metrics, tokenization, developer expertise | | |
| | **Interventionist** | - | Ablation, alternatives, hypothesis testing | | |
| --- | |
| ## Thesis Structure Suggestions | |
| ### Chapter 1: Introduction | |
| - **Motivation:** Developers treat LLMs as black boxes β trust issues, debugging difficulties | |
| - **Gap:** Prior work lacks interactive, developer-facing, multi-mechanism dashboards for code | |
| - **Contribution:** First glass-box dashboard integrating 4 interpretability lenses + developer study | |
| ### Chapter 2: Literature Review | |
| - **Section 2.1:** Attention in LLMs (Zheng et al., Kou et al.) | |
| - **Section 2.2:** Explainability methods (Zhao et al.) | |
| - **Section 2.3:** Code generation LLMs (Bistarelli et al.) | |
| - **Section 2.4:** Developer-AI interaction (Paltenghi et al.) | |
| - **Section 2.5:** Mechanistic interpretability (Wang et al., Bansal et al.) | |
| ### Chapter 3: Methodology (RQ1 Focus) | |
| - **Section 3.1:** Attention Visualization | |
| - **Section 3.2:** Token Size & Confidence Visualization | |
| - **Section 3.3:** Ablation Visualization | |
| - **Section 3.4:** Pipeline Visualization | |
| - **Section 3.5:** Dashboard Integration | |
| ### Chapter 4: User Study Design | |
| - **Section 4.1:** Participants (n=18-24 software engineers) | |
| - **Section 4.2:** Tasks (T1, T2, T3) | |
| - **Section 4.3:** Metrics (quantitative + qualitative) | |
| - **Section 4.4:** Protocol (within-subjects, Latin square) | |
| ### Chapter 5: Results | |
| - **Section 5.1:** RQ1.1-RQ1.4 (Attention) | |
| - **Section 5.2:** RQ1.5-RQ1.8 (Token Confidence) | |
| - **Section 5.3:** RQ1.9-RQ1.12 (Ablation) | |
| - **Section 5.4:** RQ1.13-RQ1.16 (Pipeline) | |
| - **Section 5.5:** Cross-Cutting Themes | |
| ### Chapter 6: Discussion | |
| - **Section 6.1:** Interpretability for Developers (not just researchers) | |
| - **Section 6.2:** Code-Specific Insights (tokenization, syntax, tests) | |
| - **Section 6.3:** Limitations & Future Work | |
| ### Chapter 7: Conclusion | |
| - **Summary of Contributions** | |
| - **Implications for Practice** (tool design for developers) | |
| - **Implications for Research** (novel layer taxonomy, ablation as debugging) | |
| --- | |
| ## ICML Paper 1 Suggestions | |
| **Title:** "Making Transformer Architecture Transparent for Code Generation: A Developer-Centric Study" | |
| **Abstract Structure:** | |
| 1. **Problem:** Developers use code LLMs as black boxes β trust/debugging issues | |
| 2. **Gap:** Prior interpretability work not developer-facing or code-specific | |
| 3. **Solution:** Glass-box dashboard with 4 visualizations (Attention, Token Confidence, Ablation, Pipeline) | |
| 4. **Study:** n=18-24 software engineers on 3 code tasks | |
| 5. **Results:** (placeholder for actual results) | |
| - Attention viz improves source identification (H1-Attn) | |
| - Token confidence flags predict bugs (H2-Tok, AUC β₯ 0.70) | |
| - Ablation reduces debugging iterations (H3-Abl, -20%) | |
| - Pipeline improves error localization (H4-Pipe) | |
| 6. **Contribution:** First empirical evidence that multi-mechanism interpretability tools improve developer performance on code tasks | |
| **Sections:** | |
| 1. Introduction | |
| 2. Related Work | |
| 3. Dashboard Design (4 visualizations) | |
| 4. User Study | |
| 5. Results | |
| 6. Discussion | |
| 7. Conclusion | |
| **Target:** ICML 2026 (submission ~January 2026) | |
| --- | |
| **End of RQ1 Mapping Document** | |