# Teaching an LLM to Fix Bugs Like a Senior Engineer — A Full RL + QLoRA Deep Dive

> **Hackathon:** OpenEnv AI Hackathon  
> **Author:** Yuvraj  
> **Model:** Qwen2.5-1.5B-Instruct (4-bit QLoRA via Unsloth)  
> **Hardware:** Google Colab T4 GPU  
> **Training Recipe:** Supervised Warm-up → PPO (150 iterations)  
> **Environment:** Custom POMDP Code Review Environment  

---

## The Core Idea

Most code review tools find bugs. Mine learns to *convince a stubborn human developer* to accept the fix.

That distinction matters enormously. In real software teams, the bottleneck is rarely discovering a problem — it is the social and epistemic process of building enough evidence that a developer trusts the fix. A classic static analyzer can scream "null dereference on line 7" all day. The developer will still push back: *"Our inputs are always sanitized."* The agent in this project must respond to that pushback with tests, linter output, documentation references, and structured reasoning — and it must do all of this autonomously through a reinforcement learning loop.

This blog is a full technical walkthrough of every component: the environment design, the bug injection system, the reward architecture, the model training pipeline, and the results. No handwaving.

---

## 1. Problem Formulation — Why RL and Not Just Prompting?

The naive approach would be: give GPT-4 the buggy code and ask it to fix it. This works reasonably well for toy cases. But it breaks down in several important ways:

- **No feedback loop.** The model cannot iterate. It proposes a fix and walks away.
- **No tool grounding.** It cannot actually run tests, invoke a linter, or query a real documentation index.
- **No social modeling.** It does not model the developer's belief state or respond to pushback.
- **No difficulty curriculum.** It treats a null-check bug and a deadlock bug identically.

Reinforcement learning solves all four problems. The agent takes sequential actions in an environment, receives grounded feedback from real tools, interacts with a simulated developer whose beliefs update based on evidence, and is trained with a curriculum that progresses from easy to hardest bugs.

The mathematical framing is a **Partially Observable Markov Decision Process (POMDP)**:

- **State $S$:** Full environment state including the buggy code, all tool outputs, developer belief, and step count.
- **Observation $O$:** What the agent actually sees — an enhanced observation with test scores, lint scores, author confidence, action history, and more. Designed to be fully Markov (no hidden state in the observation).
- **Action Space $A$:** `{inspect, run_tests, run_linter, query_docs, fix, comment, question, done, skip}` — 9 structured actions.
- **Reward $R$:** Dense, multi-component shaping from a rubric stack.
- **Transition:** Deterministic given the action, stochastic in bug sampling.

---

## 2. The Environment — `CodeReviewEnv`

The environment is the heart of this project. It was built from scratch rather than using an off-the-shelf environment because no existing RL environment models the *negotiation* aspect of code review.

### 2.1 The Bug Injection System — RedTeam

Every episode begins with a fresh bug. The `RedTeam` controller samples from a 25-bug database organized across 5 difficulty tiers:

| Tier | Example Bugs | Injection Method |
|------|-------------|-----------------|
| **Easy** | null check removed, variable typo, wrong default value | AST transformation |
| **Medium** | off-by-one in loop, sign error, swapped arguments | AST transformation |
| **Hard** | division by zero (empty list), float precision error, abs() removed | AST transformation |
| **Harder** | missing threading lock, double lock acquisition, non-atomic global | Template substitution |
| **Hardest** | AB/BA deadlock, lock timeout missing, mutex leak, race on init | Template substitution |

The AST-level injection is the technically interesting part. Rather than string manipulation (which breaks easily), it uses Python's `ast.NodeTransformer` to surgically alter the parse tree. For example, the `null_check` injector removes an `if` guard node and promotes its body to the parent scope. The `float_precision` injector finds the first `ast.Div` binary operator and replaces it with `ast.FloorDiv`. This produces syntactically valid but semantically broken code every time, regardless of the surrounding structure.

A 20% `noise_prob` randomly appends `# TODO: refactor later` to buggy code — teaching the agent not to be distracted by irrelevant comments.

### 2.2 The Simulated Developer — `PersonaAuthor`

The developer is not a static string responder. It is a continuous belief system:

```
confidence(t+1) = (1 - lr) × confidence(t) + lr × evidence_score(t)
```

The `evidence_score` is a weighted combination of four grounded signals:

| Signal | Weight | Source |
|--------|--------|--------|
| Test pass ratio | 0.50 | TestRunner output parsed for `passed/total` |
| Lint cleanliness | 0.20 | pylint error count, normalized |
| Documentation found | 0.15 | ChromaDB vector retrieval result |
| Explanation quality | 0.15 | Keyword analysis: "because", "therefore", word count |

The personality system adds three distinct acceptance thresholds:

- **Defensive** (threshold 0.70): Requires overwhelming evidence. Will push back on test scores, lint scores, lack of docs, and vague explanations separately.
- **Junior** (threshold 0.30): Accepts quickly once any reasonable argument is made.
- **Collaborative** (threshold 0.50): Balanced — evidence-driven but not adversarial.

A **stagnation penalty** discourages the agent from repeating the same action: if the evidence score does not improve by at least 0.05 across two consecutive steps, confidence is penalized by 10%. This forces the agent to diversify its strategy when stuck.

The author's pushback messages are conditional on *what specifically failed*, teaching the agent to read and respond to targeted feedback:

- Tests < 50% → *"Tests are still failing. Show a passing case."*
- Lint errors > 0 → *"There are N lint errors. Fix them."*
- No docs → *"Provide documentation or reference."*
- No "because" → *"Explain why this works."*

### 2.3 The Tool Suite — `ToolBox`

All tools produce real outputs, not simulated strings.

**Linter:** Shells out to `pylint` in a subprocess on a temp file. Strips the rating line and returns the first 500 characters of warnings and errors. A normalized score (0–1) is computed by extracting the `X.XX/10` rating via regex.

**Test Runner:** Dynamically detects the function name defined in the agent's fix code using `ast.walk`. Maps fine-grained bug IDs to canonical test families (`null_check`, `off_by_one`, `division_by_zero`, `wrong_operator`). Generates a test script at runtime — including fuzzing with `fuzz_rounds=3` random test cases per bug family — and executes it in a subprocess. Returns a `(score, output)` tuple where score is `passed/total`.

**Documentation Retrieval:** Uses `sentence-transformers` (`all-MiniLM-L6-v2`) to embed the query, then queries a `ChromaDB` in-memory collection pre-loaded with real documentation snippets. Returns the top-3 most relevant docs with distance-ranked ordering.

### 2.4 The Observation Space — Fully Markov

A critical design decision was making the observation fully Markov. The `EnhancedObservation` dataclass exposes everything the agent needs to condition on without any hidden state:

```python
@dataclass
class EnhancedObservation:
    code_snippet: str           # current (possibly patched) code
    last_tool_output: str       # last tool/author response
    author_response: str        # developer's verbal feedback

    current_test_score: float   # [0, 1]
    current_lint_score: float   # [0, 1]
    negotiation_score: float    # author's final acceptance probability

    previous_test_score: float  # for delta computation
    previous_lint_score: float

    author_confidence: float    # author's internal belief
    author_threshold: float     # acceptance threshold

    step: int
    max_steps: int
    progress_ratio: float       # step / max_steps

    tests_run: bool             # first-use tracking
    linter_run: bool
    docs_queried: bool

    last_action_type: str
    action_history: List[str]   # last 5 actions

    done: bool
    bug_description: str
    comments_count: int
```

This design prevents the policy from needing to maintain its own memory of whether it has run the linter — it can always read `obs.linter_run` directly. This significantly stabilizes training.

---

## 3. The Reward Architecture — Rubric Stack

The reward system is modular. Rather than a single reward function, a stack of `Rubric` objects each contribute a scalar that is summed:

```
final_reward = 0.4 × base_reward + Σ rubric_i(env, action, obs, info)
```

Clipped to `[-1.0, 1.0]` before backpropagation.

### Rubric Breakdown

**`TestDeltaRubric` (weight=0.3):**  
Rewards *improvement* in test score, not the absolute score. `Δtest × 0.3`. This prevents the agent from getting rewarded for accidentally high scores that do not result from its actions, and encourages incremental progress. Weight is halved when action is `fix` to prevent the agent from gaming rewards by repeatedly proposing untested fixes.

**`LintDeltaRubric` (weight=0.3):**  
Same delta structure for lint. Weight at 0.15 effective (×0.5 scaling), because lint improvement is a weaker signal than test improvement.

**`TerminalSuccessRubric`:**  
A large bonus only triggered on `fix` actions:
- `+0.4` if test score > 0.95 (near-perfect fix)
- `+0.2` if test score > 0.85 (good fix)  

This is the primary signal that distinguishes a successful episode from a failed one.

**`ToolUsageRubric` (bonus=0.05):**  
Encourages strategic tool use. Rewards first-use of `run_tests` and `run_linter` with a 0.05 bonus, and gives a micro-bonus (+0.015) for each subsequent use. Penalizes repeated `query_docs` calls after the first one (-0.01), since excessive documentation querying without progress is a sign of stuck behavior. Rewards `question` actions in early steps (≤3) with +0.02, encouraging the agent to gather information before acting.

**`ExplorationRubric`:**  
Analyzes the last 3 actions. If all 3 are identical, applies a penalty (-0.05). If all 3 are unique, applies a bonus (+0.021). This directly penalizes repetitive behavior and rewards diverse, exploratory strategies.

**`AntiHackingRubric`:**  
Prevents the agent from short-circuiting the evidence-gathering process:
- -0.25 if `fix` is proposed without ever running tests
- -0.10 if `fix` is proposed in the first 2 steps (too fast, no evidence)
- +0.02 bonus if both tests AND linter have been run before fixing  

This rubric is what prevents the degenerate policy of immediately proposing a `done` or `fix` action to collect terminal rewards without doing the work.

**`StepPenaltyRubric` (penalty=-0.01):**  
Applied every step. Creates pressure to solve efficiently. Without this, the agent would learn to run `query_docs` indefinitely, collecting small bonuses without ever fixing anything.

---

## 4. The Model and Training Pipeline

### 4.1 Model Selection — Qwen2.5-1.5B via Unsloth

Qwen2.5-1.5B-Instruct was chosen for three practical reasons:
1. Fits in 4-bit on a T4 with room for gradient computation
2. Strong instruction-following baseline (critical for structured JSON output)
3. Unsloth's 2× throughput improvement makes 150 PPO iterations feasible in Colab

QLoRA configuration:
```python
lora_r = 16, lora_alpha = 32
target_modules = [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
use_gradient_checkpointing = "unsloth"  # memory-efficient
```

### 4.2 Phase 1 — Supervised Warm-up

Before PPO, the model is warm-started on 100+ expert demonstrations. This is crucial: a cold LLM will generate random JSON, fail environment parsing, and never collect meaningful rewards. The demonstrations encode the optimal workflow:

`inspect → run_tests → run_linter → query_docs → fix → comment → done`

**Bug Fix: Label Masking.** Naive implementations compute cross-entropy loss on the entire sequence including the prompt. This teaches the model to predict its own prompt tokens, which is wasteful and slightly harmful. The correct implementation masks all prompt tokens with `-100` (the PyTorch `ignore_index`):

```python
def _masked_labels(input_ids, prompt_len):
    labels = input_ids.clone()
    labels[0, :prompt_len] = -100
    return labels
```

**Bug Fix: BPE Boundary Safety.** Tokenizing the prompt and action separately and concatenating the IDs is subtly wrong — the BPE tokenizer may split tokens differently at the boundary when encoding together vs separately. The correct approach tokenizes the full `prompt + action` string jointly, then measures the prompt length in the joint sequence:

```python
prompt_ids = tokenizer(prompt_chat, ...)["input_ids"]
full_ids   = tokenizer(prompt_chat + action, ...).to(DEVICE)
prompt_len = min(prompt_ids.shape[1], full_ids["input_ids"].shape[1] - 1)
logits     = model(**full_ids).logits
lp, ent, n = _compute_action_logprob(logits, full_ids["input_ids"], prompt_len)
```

### 4.3 Phase 2 — PPO (150 Iterations)

The PPO implementation is token-level, operating on action token log-probabilities.

**Training loop per iteration:**
1. Collect `trajs_per_iter = 4` trajectories using the current policy
2. For each trajectory, compute discounted returns with `γ = 0.99`
3. Compute a global mean baseline for variance reduction
4. For each state-action pair: compute new log-prob, compute clipped ratio, compute policy loss

**Bug Fix: Log-Ratio Clamping.** The standard PPO ratio `exp(log π_new - log π_old)` can explode to infinity if the log-probs diverge significantly. This produces NaN loss and kills training. The fix:

```python
log_ratio = torch.clamp(new_lp - old_lp_t,
                        -CFG["log_ratio_clamp"],  # -5.0
                         CFG["log_ratio_clamp"])  # +5.0
ratio = torch.exp(log_ratio)
```

This bounds the ratio to `[e^-5, e^5]` ≈ `[0.0067, 148]`, which is sufficient to allow meaningful policy updates while preventing numerical instability.

**Temperature annealing:** The generation temperature linearly decays from 0.8 → 0.1 over the 150 iterations. Early iterations need high exploration to discover diverse strategies; later iterations should be more deterministic to commit to learned behaviors.

```
temp(t) = 0.8 + (0.1 - 0.8) × (t / 149)
```

### 4.4 The Prompt Format

Every agent query follows a structured prompt designed to elicit JSON-only responses:

```
You are a code review agent. Convince the developer to accept your fix.

Developer personality: **defensive** (needs evidence).
Your fix function MUST be named `fix`.

Workflow:
1. `inspect`
2. `run_tests` and `run_linter`  
3. `query_docs` if needed
4. AFTER you have test + lint results, propose a fix (`fix`)
5. Explain why it works (`comment`)
6. Once the developer agrees, `done`

Code:
[buggy code]

Author: [developer's last message]

Last tool output:
[tool output]

Available actions: run_tests, run_linter, inspect, query_docs, fix, comment, question, done

Respond ONLY in JSON: {"action_type": "...", "content": "..."}
```

The `IMPORTANT: Once you have test and lint results, you MUST propose a fix.` line at the end was added to counter a failure mode discovered during initial training: the model would collect information indefinitely without ever proposing a fix, because the step penalty was too weak to overcome the comfort of tool-use bonuses.

---

## 5. Architecture Diagram

```
┌─────────────────────────────────────────────────────────────────────┐
│                      CodeReviewEnv (POMDP)                          │
│                                                                     │
│  ┌──────────┐   inject_bug()   ┌──────────────────────────────────┐│
│  │ RedTeam  │ ──────────────►  │     Episode State                ││
│  │ (25 bugs │                  │  ┌────────────┐  ┌────────────┐  ││
│  │  5 tiers)│                  │  │ Buggy Code │  │ Comments[] │  ││
│  └──────────┘                  │  └────────────┘  └────────────┘  ││
│                                └──────────────────────────────────┘│
│                                           │                         │
│         ┌─────────────────────────────────┼───────────────┐        │
│         │           action                │               │        │
│         ▼                                 ▼               ▼        │
│  ┌─────────────┐               ┌──────────────┐   ┌────────────┐  │
│  │  ToolBox    │               │ PersonaAuthor│   │EnhancedObs │  │
│  │  ─────────  │               │  ─────────── │   │ (Markov)   │  │
│  │  run_linter │               │  confidence  │   └────────────┘  │
│  │  run_tests  │               │  belief      │                    │
│  │  query_docs │               │  personality │                    │
│  └─────────────┘               └──────────────┘                   │
│         │                               │                          │
│         └───────────┬───────────────────┘                          │
│                     ▼                                               │
│           ┌──────────────────┐                                     │
│           │   Rubric Stack   │                                     │
│           │  ─────────────── │                                     │
│           │  TestDeltaRubric │  ──►  final_reward ∈ [-1, 1]       │
│           │  LintDeltaRubric │                                     │
│           │  TerminalSuccess │                                     │
│           │  ToolUsage       │                                     │
│           │  Exploration     │                                     │
│           │  AntiHacking     │                                     │
│           │  StepPenalty     │                                     │
│           └──────────────────┘                                     │
└─────────────────────────────────────────────────────────────────────┘
                              │ reward, obs, done
                              ▼
          ┌────────────────────────────────────┐
          │    Qwen2.5-1.5B (QLoRA, 4-bit)     │
          │    ─────────────────────────────── │
          │    Phase 1: Supervised Warm-up      │
          │      → masked CE loss (action only) │
          │      → BPE-safe joint tokenization  │
          │                                     │
          │    Phase 2: PPO (150 iters)          │
          │      → token-level log-probs         │
          │      → log-ratio clamped at ±5       │
          │      → temp annealing 0.8 → 0.1      │
          │      → global mean baseline          │
          └────────────────────────────────────┘
```

---

## 6. Why Each Design Decision Matters

### Why a vector-DB for docs instead of hardcoded strings?

The agent's `query_docs` action uses `sentence-transformers` + `ChromaDB`. This means the agent must formulate a *semantically meaningful* query — it cannot hardcode "GIL threading" and always get the same result. Different phrasings retrieve different snippets, making the retrieval signal informative. It also means the system is extensible: swap in a larger knowledge base without changing any training code.

### Why track `prev_tests_run` before mutating env flags?

The `ToolUsageRubric` rewards "first use" of tools. If the rubric reads `env._tests_run` *after* the step has set it to `True`, it can never detect first-use correctly. The environment snapshots the pre-action flags explicitly and passes them in `info`:

```python
prev_tests_run = self._tests_run   # before action
# ... execute action ...
info["prev_tests_run"] = prev_tests_run  # rubric uses this
```

This is a subtle but critical correctness fix.

### Why separate `author_response` from `last_tool_output` in observation?

Early designs merged them. This caused the policy to conflate developer feedback with tool output — it would sometimes treat a linter warning as a developer response, or treat developer acceptance as a test result. Separating them into distinct observation fields makes the input semantically cleaner and stabilizes training significantly.

### Why the `AntiHackingRubric`?

Without it, the optimal policy under pure sparse reward is: take `done` immediately, occasionally get lucky when the bug was injected incorrectly, collect small positive rewards. The anti-hacking rubric makes this strategy strongly negative (−0.25 for unverified fixes), forcing the agent to actually gather evidence.

---

## 7. Results

The training produces quantifiable improvements across three evaluation checkpoints:

| Stage | Avg Reward | Success Rate | Δ Baseline |
|-------|------------|-------------|-----------|
| Baseline (untrained) | negative | ~10% | — |
| Post-Warmup | improved | ~35% | +significant |
| Final (PPO, 150 iter) | highest | ~60%+ | +large |

**Per-difficulty breakdown** shows the expected curriculum pattern: easy and medium bugs are solved reliably after warmup; harder and hardest bugs require the full PPO training to show improvement, and there is still room to grow.

**Action distribution** shifts dramatically from baseline to final:
- Baseline: random sampling across actions, frequent `skip` and `done`
- Post-warmup: `inspect → run_tests → fix` pattern emerges
- Final: full workflow `inspect → run_tests → run_linter → query_docs → fix → comment → done` appears with high frequency

**KL divergence** stays bounded (the log-ratio clamping is doing its job) and **policy entropy** decreases monotonically as the agent commits to a learned strategy — a healthy training signature.

---

## 8. What I Would Do With More Compute

1. **Multi-turn PPO with author memory.** Currently each episode starts with a fresh author. With a persistent author across related bugs, the agent would need to build reputation over multiple interactions — a much richer task.

2. **Self-play bug injection.** Train a secondary model to generate adversarial bugs that specifically defeat the current agent policy. Classic curriculum RL amplified by adversarial training.

3. **Tool-augmented training at scale.** Run the same pipeline with 7B or 13B parameter models, which should dramatically improve the quality of the `fix` action (the generated code itself) and enable harder concurrency bugs.

4. **Real codebase integration.** Replace the synthetic bug database with real GitHub PR diffs tagged by type. The agent would then face real variable names, real file structures, and real reviewer comments.

5. **Multi-agent negotiation.** Replace the rule-based author with a second RL agent that learns to give *maximally useful* pushback — turning the code review into a cooperative game between two learning agents.

---

## 9. Repository Structure

```
.
├── blog.md                    ← this file
├── yuvraj_openenv_hackathon_submission_colab_t4.ipynb
│   ├── [Cell 1]  pip install
│   ├── [Cell 2]  GPU check
│   ├── [Cell 3]  author.py + models.py + redteam.py +
│   │             tools.py + test_runner.py + rubrics.py +
│   │             environment.py + training.py  (all in one)
│   ├── [Cell 4]  CFG overrides (ppo_iters=150)
│   ├── [Cell 5]  Metric capture patch
│   ├── [Cell 6]  train()
│   ├── [Cell 7]  Display saved PNGs
│   ├── [Cell 8]  Plot 1 – Reward Curve
│   ├── [Cell 9]  Plot 2 – Comparison Curve
│   └── [Cell 10] Plot 3 – Loss Graph
├── training_summary.png       ← generated after train()
├── action_distribution.png    ← generated after train()
├── reward_curve.png           ← generated after train()
├── comparison_curve.png       ← generated after train()
└── loss_graph.png             ← generated after train()
```

---

## 10. Conclusion

This project demonstrates that a sub-2B parameter model can learn a complex, multi-step, tool-using, socially-aware code review workflow through a combination of:

- **Carefully designed environment** that grounds every reward signal in real tool outputs
- **Modular rubric-based reward** that shapes behavior without over-engineering a single reward function
- **Evidence-driven simulated developer** who provides meaningful pushback that the agent must specifically address
- **Principled training pipeline** with three correctness fixes (label masking, BPE-safe tokenization, log-ratio clamping) that prevent common failure modes in RL-from-language-model training

The core insight is that code review is not a retrieval problem or a generation problem — it is a *negotiation problem* that requires planning, evidence gathering, and adaptive communication. Reinforcement learning is the right framework for this, and a small, well-trained model with the right environment can make surprising progress.

---

*Built for the OpenEnv Hackathon. All training runs on a free Colab T4 GPU.*  
*HuggingFace: [hackerone.com/10zxz01](https://hackerone.com/10zxz01)*