Teaching an LLM to Fix Bugs Like a Senior Engineer β A Full RL + QLoRA Deep Dive
Hackathon: OpenEnv AI Hackathon
Author: Yuvraj
Model: Qwen2.5-1.5B-Instruct (4-bit QLoRA via Unsloth)
Hardware: Google Colab T4 GPU
Training Recipe: Supervised Warm-up β PPO (150 iterations)
Environment: Custom POMDP Code Review Environment
The Core Idea
Most code review tools find bugs. Mine learns to convince a stubborn human developer to accept the fix.
That distinction matters enormously. In real software teams, the bottleneck is rarely discovering a problem β it is the social and epistemic process of building enough evidence that a developer trusts the fix. A classic static analyzer can scream "null dereference on line 7" all day. The developer will still push back: "Our inputs are always sanitized." The agent in this project must respond to that pushback with tests, linter output, documentation references, and structured reasoning β and it must do all of this autonomously through a reinforcement learning loop.
This blog is a full technical walkthrough of every component: the environment design, the bug injection system, the reward architecture, the model training pipeline, and the results. No handwaving.
1. Problem Formulation β Why RL and Not Just Prompting?
The naive approach would be: give GPT-4 the buggy code and ask it to fix it. This works reasonably well for toy cases. But it breaks down in several important ways:
- No feedback loop. The model cannot iterate. It proposes a fix and walks away.
- No tool grounding. It cannot actually run tests, invoke a linter, or query a real documentation index.
- No social modeling. It does not model the developer's belief state or respond to pushback.
- No difficulty curriculum. It treats a null-check bug and a deadlock bug identically.
Reinforcement learning solves all four problems. The agent takes sequential actions in an environment, receives grounded feedback from real tools, interacts with a simulated developer whose beliefs update based on evidence, and is trained with a curriculum that progresses from easy to hardest bugs.
The mathematical framing is a Partially Observable Markov Decision Process (POMDP):
- State $S$: Full environment state including the buggy code, all tool outputs, developer belief, and step count.
- Observation $O$: What the agent actually sees β an enhanced observation with test scores, lint scores, author confidence, action history, and more. Designed to be fully Markov (no hidden state in the observation).
- Action Space $A$:
{inspect, run_tests, run_linter, query_docs, fix, comment, question, done, skip}β 9 structured actions. - Reward $R$: Dense, multi-component shaping from a rubric stack.
- Transition: Deterministic given the action, stochastic in bug sampling.
2. The Environment β CodeReviewEnv
The environment is the heart of this project. It was built from scratch rather than using an off-the-shelf environment because no existing RL environment models the negotiation aspect of code review.
2.1 The Bug Injection System β RedTeam
Every episode begins with a fresh bug. The RedTeam controller samples from a 25-bug database organized across 5 difficulty tiers:
| Tier | Example Bugs | Injection Method |
|---|---|---|
| Easy | null check removed, variable typo, wrong default value | AST transformation |
| Medium | off-by-one in loop, sign error, swapped arguments | AST transformation |
| Hard | division by zero (empty list), float precision error, abs() removed | AST transformation |
| Harder | missing threading lock, double lock acquisition, non-atomic global | Template substitution |
| Hardest | AB/BA deadlock, lock timeout missing, mutex leak, race on init | Template substitution |
The AST-level injection is the technically interesting part. Rather than string manipulation (which breaks easily), it uses Python's ast.NodeTransformer to surgically alter the parse tree. For example, the null_check injector removes an if guard node and promotes its body to the parent scope. The float_precision injector finds the first ast.Div binary operator and replaces it with ast.FloorDiv. This produces syntactically valid but semantically broken code every time, regardless of the surrounding structure.
A 20% noise_prob randomly appends # TODO: refactor later to buggy code β teaching the agent not to be distracted by irrelevant comments.
2.2 The Simulated Developer β PersonaAuthor
The developer is not a static string responder. It is a continuous belief system:
confidence(t+1) = (1 - lr) Γ confidence(t) + lr Γ evidence_score(t)
The evidence_score is a weighted combination of four grounded signals:
| Signal | Weight | Source |
|---|---|---|
| Test pass ratio | 0.50 | TestRunner output parsed for passed/total |
| Lint cleanliness | 0.20 | pylint error count, normalized |
| Documentation found | 0.15 | ChromaDB vector retrieval result |
| Explanation quality | 0.15 | Keyword analysis: "because", "therefore", word count |
The personality system adds three distinct acceptance thresholds:
- Defensive (threshold 0.70): Requires overwhelming evidence. Will push back on test scores, lint scores, lack of docs, and vague explanations separately.
- Junior (threshold 0.30): Accepts quickly once any reasonable argument is made.
- Collaborative (threshold 0.50): Balanced β evidence-driven but not adversarial.
A stagnation penalty discourages the agent from repeating the same action: if the evidence score does not improve by at least 0.05 across two consecutive steps, confidence is penalized by 10%. This forces the agent to diversify its strategy when stuck.
The author's pushback messages are conditional on what specifically failed, teaching the agent to read and respond to targeted feedback:
- Tests < 50% β "Tests are still failing. Show a passing case."
- Lint errors > 0 β "There are N lint errors. Fix them."
- No docs β "Provide documentation or reference."
- No "because" β "Explain why this works."
2.3 The Tool Suite β ToolBox
All tools produce real outputs, not simulated strings.
Linter: Shells out to pylint in a subprocess on a temp file. Strips the rating line and returns the first 500 characters of warnings and errors. A normalized score (0β1) is computed by extracting the X.XX/10 rating via regex.
Test Runner: Dynamically detects the function name defined in the agent's fix code using ast.walk. Maps fine-grained bug IDs to canonical test families (null_check, off_by_one, division_by_zero, wrong_operator). Generates a test script at runtime β including fuzzing with fuzz_rounds=3 random test cases per bug family β and executes it in a subprocess. Returns a (score, output) tuple where score is passed/total.
Documentation Retrieval: Uses sentence-transformers (all-MiniLM-L6-v2) to embed the query, then queries a ChromaDB in-memory collection pre-loaded with real documentation snippets. Returns the top-3 most relevant docs with distance-ranked ordering.
2.4 The Observation Space β Fully Markov
A critical design decision was making the observation fully Markov. The EnhancedObservation dataclass exposes everything the agent needs to condition on without any hidden state:
@dataclass
class EnhancedObservation:
code_snippet: str # current (possibly patched) code
last_tool_output: str # last tool/author response
author_response: str # developer's verbal feedback
current_test_score: float # [0, 1]
current_lint_score: float # [0, 1]
negotiation_score: float # author's final acceptance probability
previous_test_score: float # for delta computation
previous_lint_score: float
author_confidence: float # author's internal belief
author_threshold: float # acceptance threshold
step: int
max_steps: int
progress_ratio: float # step / max_steps
tests_run: bool # first-use tracking
linter_run: bool
docs_queried: bool
last_action_type: str
action_history: List[str] # last 5 actions
done: bool
bug_description: str
comments_count: int
This design prevents the policy from needing to maintain its own memory of whether it has run the linter β it can always read obs.linter_run directly. This significantly stabilizes training.
3. The Reward Architecture β Rubric Stack
The reward system is modular. Rather than a single reward function, a stack of Rubric objects each contribute a scalar that is summed:
final_reward = 0.4 Γ base_reward + Ξ£ rubric_i(env, action, obs, info)
Clipped to [-1.0, 1.0] before backpropagation.
Rubric Breakdown
TestDeltaRubric (weight=0.3):
Rewards improvement in test score, not the absolute score. Ξtest Γ 0.3. This prevents the agent from getting rewarded for accidentally high scores that do not result from its actions, and encourages incremental progress. Weight is halved when action is fix to prevent the agent from gaming rewards by repeatedly proposing untested fixes.
LintDeltaRubric (weight=0.3):
Same delta structure for lint. Weight at 0.15 effective (Γ0.5 scaling), because lint improvement is a weaker signal than test improvement.
TerminalSuccessRubric:
A large bonus only triggered on fix actions:
+0.4if test score > 0.95 (near-perfect fix)+0.2if test score > 0.85 (good fix)
This is the primary signal that distinguishes a successful episode from a failed one.
ToolUsageRubric (bonus=0.05):
Encourages strategic tool use. Rewards first-use of run_tests and run_linter with a 0.05 bonus, and gives a micro-bonus (+0.015) for each subsequent use. Penalizes repeated query_docs calls after the first one (-0.01), since excessive documentation querying without progress is a sign of stuck behavior. Rewards question actions in early steps (β€3) with +0.02, encouraging the agent to gather information before acting.
ExplorationRubric:
Analyzes the last 3 actions. If all 3 are identical, applies a penalty (-0.05). If all 3 are unique, applies a bonus (+0.021). This directly penalizes repetitive behavior and rewards diverse, exploratory strategies.
AntiHackingRubric:
Prevents the agent from short-circuiting the evidence-gathering process:
- -0.25 if
fixis proposed without ever running tests - -0.10 if
fixis proposed in the first 2 steps (too fast, no evidence) - +0.02 bonus if both tests AND linter have been run before fixing
This rubric is what prevents the degenerate policy of immediately proposing a done or fix action to collect terminal rewards without doing the work.
StepPenaltyRubric (penalty=-0.01):
Applied every step. Creates pressure to solve efficiently. Without this, the agent would learn to run query_docs indefinitely, collecting small bonuses without ever fixing anything.
4. The Model and Training Pipeline
4.1 Model Selection β Qwen2.5-1.5B via Unsloth
Qwen2.5-1.5B-Instruct was chosen for three practical reasons:
- Fits in 4-bit on a T4 with room for gradient computation
- Strong instruction-following baseline (critical for structured JSON output)
- Unsloth's 2Γ throughput improvement makes 150 PPO iterations feasible in Colab
QLoRA configuration:
lora_r = 16, lora_alpha = 32
target_modules = [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
use_gradient_checkpointing = "unsloth" # memory-efficient
4.2 Phase 1 β Supervised Warm-up
Before PPO, the model is warm-started on 100+ expert demonstrations. This is crucial: a cold LLM will generate random JSON, fail environment parsing, and never collect meaningful rewards. The demonstrations encode the optimal workflow:
inspect β run_tests β run_linter β query_docs β fix β comment β done
Bug Fix: Label Masking. Naive implementations compute cross-entropy loss on the entire sequence including the prompt. This teaches the model to predict its own prompt tokens, which is wasteful and slightly harmful. The correct implementation masks all prompt tokens with -100 (the PyTorch ignore_index):
def _masked_labels(input_ids, prompt_len):
labels = input_ids.clone()
labels[0, :prompt_len] = -100
return labels
Bug Fix: BPE Boundary Safety. Tokenizing the prompt and action separately and concatenating the IDs is subtly wrong β the BPE tokenizer may split tokens differently at the boundary when encoding together vs separately. The correct approach tokenizes the full prompt + action string jointly, then measures the prompt length in the joint sequence:
prompt_ids = tokenizer(prompt_chat, ...)["input_ids"]
full_ids = tokenizer(prompt_chat + action, ...).to(DEVICE)
prompt_len = min(prompt_ids.shape[1], full_ids["input_ids"].shape[1] - 1)
logits = model(**full_ids).logits
lp, ent, n = _compute_action_logprob(logits, full_ids["input_ids"], prompt_len)
4.3 Phase 2 β PPO (150 Iterations)
The PPO implementation is token-level, operating on action token log-probabilities.
Training loop per iteration:
- Collect
trajs_per_iter = 4trajectories using the current policy - For each trajectory, compute discounted returns with
Ξ³ = 0.99 - Compute a global mean baseline for variance reduction
- For each state-action pair: compute new log-prob, compute clipped ratio, compute policy loss
Bug Fix: Log-Ratio Clamping. The standard PPO ratio exp(log Ο_new - log Ο_old) can explode to infinity if the log-probs diverge significantly. This produces NaN loss and kills training. The fix:
log_ratio = torch.clamp(new_lp - old_lp_t,
-CFG["log_ratio_clamp"], # -5.0
CFG["log_ratio_clamp"]) # +5.0
ratio = torch.exp(log_ratio)
This bounds the ratio to [e^-5, e^5] β [0.0067, 148], which is sufficient to allow meaningful policy updates while preventing numerical instability.
Temperature annealing: The generation temperature linearly decays from 0.8 β 0.1 over the 150 iterations. Early iterations need high exploration to discover diverse strategies; later iterations should be more deterministic to commit to learned behaviors.
temp(t) = 0.8 + (0.1 - 0.8) Γ (t / 149)
4.4 The Prompt Format
Every agent query follows a structured prompt designed to elicit JSON-only responses:
You are a code review agent. Convince the developer to accept your fix.
Developer personality: **defensive** (needs evidence).
Your fix function MUST be named `fix`.
Workflow:
1. `inspect`
2. `run_tests` and `run_linter`
3. `query_docs` if needed
4. AFTER you have test + lint results, propose a fix (`fix`)
5. Explain why it works (`comment`)
6. Once the developer agrees, `done`
Code:
[buggy code]
Author: [developer's last message]
Last tool output:
[tool output]
Available actions: run_tests, run_linter, inspect, query_docs, fix, comment, question, done
Respond ONLY in JSON: {"action_type": "...", "content": "..."}
The IMPORTANT: Once you have test and lint results, you MUST propose a fix. line at the end was added to counter a failure mode discovered during initial training: the model would collect information indefinitely without ever proposing a fix, because the step penalty was too weak to overcome the comfort of tool-use bonuses.
5. Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CodeReviewEnv (POMDP) β
β β
β ββββββββββββ inject_bug() βββββββββββββββββββββββββββββββββββββ
β β RedTeam β βββββββββββββββΊ β Episode State ββ
β β (25 bugs β β ββββββββββββββ ββββββββββββββ ββ
β β 5 tiers)β β β Buggy Code β β Comments[] β ββ
β ββββββββββββ β ββββββββββββββ ββββββββββββββ ββ
β βββββββββββββββββββββββββββββββββββββ
β β β
β βββββββββββββββββββββββββββββββββββΌββββββββββββββββ β
β β action β β β
β βΌ βΌ βΌ β
β βββββββββββββββ ββββββββββββββββ ββββββββββββββ β
β β ToolBox β β PersonaAuthorβ βEnhancedObs β β
β β βββββββββ β β βββββββββββ β β (Markov) β β
β β run_linter β β confidence β ββββββββββββββ β
β β run_tests β β belief β β
β β query_docs β β personality β β
β βββββββββββββββ ββββββββββββββββ β
β β β β
β βββββββββββββ¬ββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββ β
β β Rubric Stack β β
β β βββββββββββββββ β β
β β TestDeltaRubric β βββΊ final_reward β [-1, 1] β
β β LintDeltaRubric β β
β β TerminalSuccess β β
β β ToolUsage β β
β β Exploration β β
β β AntiHacking β β
β β StepPenalty β β
β ββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β reward, obs, done
βΌ
ββββββββββββββββββββββββββββββββββββββ
β Qwen2.5-1.5B (QLoRA, 4-bit) β
β βββββββββββββββββββββββββββββββ β
β Phase 1: Supervised Warm-up β
β β masked CE loss (action only) β
β β BPE-safe joint tokenization β
β β
β Phase 2: PPO (150 iters) β
β β token-level log-probs β
β β log-ratio clamped at Β±5 β
β β temp annealing 0.8 β 0.1 β
β β global mean baseline β
ββββββββββββββββββββββββββββββββββββββ
6. Why Each Design Decision Matters
Why a vector-DB for docs instead of hardcoded strings?
The agent's query_docs action uses sentence-transformers + ChromaDB. This means the agent must formulate a semantically meaningful query β it cannot hardcode "GIL threading" and always get the same result. Different phrasings retrieve different snippets, making the retrieval signal informative. It also means the system is extensible: swap in a larger knowledge base without changing any training code.
Why track prev_tests_run before mutating env flags?
The ToolUsageRubric rewards "first use" of tools. If the rubric reads env._tests_run after the step has set it to True, it can never detect first-use correctly. The environment snapshots the pre-action flags explicitly and passes them in info:
prev_tests_run = self._tests_run # before action
# ... execute action ...
info["prev_tests_run"] = prev_tests_run # rubric uses this
This is a subtle but critical correctness fix.
Why separate author_response from last_tool_output in observation?
Early designs merged them. This caused the policy to conflate developer feedback with tool output β it would sometimes treat a linter warning as a developer response, or treat developer acceptance as a test result. Separating them into distinct observation fields makes the input semantically cleaner and stabilizes training significantly.
Why the AntiHackingRubric?
Without it, the optimal policy under pure sparse reward is: take done immediately, occasionally get lucky when the bug was injected incorrectly, collect small positive rewards. The anti-hacking rubric makes this strategy strongly negative (β0.25 for unverified fixes), forcing the agent to actually gather evidence.
7. Results
The training produces quantifiable improvements across three evaluation checkpoints:
| Stage | Avg Reward | Success Rate | Ξ Baseline |
|---|---|---|---|
| Baseline (untrained) | negative | ~10% | β |
| Post-Warmup | improved | ~35% | +significant |
| Final (PPO, 150 iter) | highest | ~60%+ | +large |
Per-difficulty breakdown shows the expected curriculum pattern: easy and medium bugs are solved reliably after warmup; harder and hardest bugs require the full PPO training to show improvement, and there is still room to grow.
Action distribution shifts dramatically from baseline to final:
- Baseline: random sampling across actions, frequent
skipanddone - Post-warmup:
inspect β run_tests β fixpattern emerges - Final: full workflow
inspect β run_tests β run_linter β query_docs β fix β comment β doneappears with high frequency
KL divergence stays bounded (the log-ratio clamping is doing its job) and policy entropy decreases monotonically as the agent commits to a learned strategy β a healthy training signature.
8. What I Would Do With More Compute
Multi-turn PPO with author memory. Currently each episode starts with a fresh author. With a persistent author across related bugs, the agent would need to build reputation over multiple interactions β a much richer task.
Self-play bug injection. Train a secondary model to generate adversarial bugs that specifically defeat the current agent policy. Classic curriculum RL amplified by adversarial training.
Tool-augmented training at scale. Run the same pipeline with 7B or 13B parameter models, which should dramatically improve the quality of the
fixaction (the generated code itself) and enable harder concurrency bugs.Real codebase integration. Replace the synthetic bug database with real GitHub PR diffs tagged by type. The agent would then face real variable names, real file structures, and real reviewer comments.
Multi-agent negotiation. Replace the rule-based author with a second RL agent that learns to give maximally useful pushback β turning the code review into a cooperative game between two learning agents.
9. Repository Structure
.
βββ blog.md β this file
βββ yuvraj_openenv_hackathon_submission_colab_t4.ipynb
β βββ [Cell 1] pip install
β βββ [Cell 2] GPU check
β βββ [Cell 3] author.py + models.py + redteam.py +
β β tools.py + test_runner.py + rubrics.py +
β β environment.py + training.py (all in one)
β βββ [Cell 4] CFG overrides (ppo_iters=150)
β βββ [Cell 5] Metric capture patch
β βββ [Cell 6] train()
β βββ [Cell 7] Display saved PNGs
β βββ [Cell 8] Plot 1 β Reward Curve
β βββ [Cell 9] Plot 2 β Comparison Curve
β βββ [Cell 10] Plot 3 β Loss Graph
βββ training_summary.png β generated after train()
βββ action_distribution.png β generated after train()
βββ reward_curve.png β generated after train()
βββ comparison_curve.png β generated after train()
βββ loss_graph.png β generated after train()
10. Conclusion
This project demonstrates that a sub-2B parameter model can learn a complex, multi-step, tool-using, socially-aware code review workflow through a combination of:
- Carefully designed environment that grounds every reward signal in real tool outputs
- Modular rubric-based reward that shapes behavior without over-engineering a single reward function
- Evidence-driven simulated developer who provides meaningful pushback that the agent must specifically address
- Principled training pipeline with three correctness fixes (label masking, BPE-safe tokenization, log-ratio clamping) that prevent common failure modes in RL-from-language-model training
The core insight is that code review is not a retrieval problem or a generation problem β it is a negotiation problem that requires planning, evidence gathering, and adaptive communication. Reinforcement learning is the right framework for this, and a small, well-trained model with the right environment can make surprising progress.
Built for the OpenEnv Hackathon. All training runs on a free Colab T4 GPU.
HuggingFace: hackerone.com/10zxz01