Spaces:

100XZX001
/

CodeReview-Professional-Workflow

Sleeping

App Files Files Community

CodeReview-Professional-Workflow / blog.md

100XZX001

Upload blog.md

3579fa4 verified 20 days ago

preview code

raw

history blame contribute delete

26.9 kB

Teaching an LLM to Fix Bugs Like a Senior Engineer — A Full RL + QLoRA Deep Dive

Hackathon: OpenEnv AI Hackathon
Author: Yuvraj
Model: Qwen2.5-1.5B-Instruct (4-bit QLoRA via Unsloth)
Hardware: Google Colab T4 GPU
Training Recipe: Supervised Warm-up → PPO (150 iterations)
Environment: Custom POMDP Code Review Environment

The Core Idea

Most code review tools find bugs. Mine learns to convince a stubborn human developer to accept the fix.

That distinction matters enormously. In real software teams, the bottleneck is rarely discovering a problem — it is the social and epistemic process of building enough evidence that a developer trusts the fix. A classic static analyzer can scream "null dereference on line 7" all day. The developer will still push back: "Our inputs are always sanitized." The agent in this project must respond to that pushback with tests, linter output, documentation references, and structured reasoning — and it must do all of this autonomously through a reinforcement learning loop.

This blog is a full technical walkthrough of every component: the environment design, the bug injection system, the reward architecture, the model training pipeline, and the results. No handwaving.

1. Problem Formulation — Why RL and Not Just Prompting?

The naive approach would be: give GPT-4 the buggy code and ask it to fix it. This works reasonably well for toy cases. But it breaks down in several important ways:

No feedback loop. The model cannot iterate. It proposes a fix and walks away.
No tool grounding. It cannot actually run tests, invoke a linter, or query a real documentation index.
No social modeling. It does not model the developer's belief state or respond to pushback.
No difficulty curriculum. It treats a null-check bug and a deadlock bug identically.

Reinforcement learning solves all four problems. The agent takes sequential actions in an environment, receives grounded feedback from real tools, interacts with a simulated developer whose beliefs update based on evidence, and is trained with a curriculum that progresses from easy to hardest bugs.

The mathematical framing is a Partially Observable Markov Decision Process (POMDP):

State $S$: Full environment state including the buggy code, all tool outputs, developer belief, and step count.
Observation $O$: What the agent actually sees — an enhanced observation with test scores, lint scores, author confidence, action history, and more. Designed to be fully Markov (no hidden state in the observation).
Action Space $A$: {inspect, run_tests, run_linter, query_docs, fix, comment, question, done, skip} — 9 structured actions.
Reward $R$: Dense, multi-component shaping from a rubric stack.
Transition: Deterministic given the action, stochastic in bug sampling.

2. The Environment — `CodeReviewEnv`

The environment is the heart of this project. It was built from scratch rather than using an off-the-shelf environment because no existing RL environment models the negotiation aspect of code review.

2.1 The Bug Injection System — RedTeam

Every episode begins with a fresh bug. The RedTeam controller samples from a 25-bug database organized across 5 difficulty tiers:

Tier	Example Bugs	Injection Method
Easy	null check removed, variable typo, wrong default value	AST transformation
Medium	off-by-one in loop, sign error, swapped arguments	AST transformation
Hard	division by zero (empty list), float precision error, abs() removed	AST transformation
Harder	missing threading lock, double lock acquisition, non-atomic global	Template substitution
Hardest	AB/BA deadlock, lock timeout missing, mutex leak, race on init	Template substitution

The AST-level injection is the technically interesting part. Rather than string manipulation (which breaks easily), it uses Python's ast.NodeTransformer to surgically alter the parse tree. For example, the null_check injector removes an if guard node and promotes its body to the parent scope. The float_precision injector finds the first ast.Div binary operator and replaces it with ast.FloorDiv. This produces syntactically valid but semantically broken code every time, regardless of the surrounding structure.

A 20% noise_prob randomly appends # TODO: refactor later to buggy code — teaching the agent not to be distracted by irrelevant comments.

2.2 The Simulated Developer — `PersonaAuthor`

The developer is not a static string responder. It is a continuous belief system:

confidence(t+1) = (1 - lr) × confidence(t) + lr × evidence_score(t)

The evidence_score is a weighted combination of four grounded signals:

Signal	Weight	Source
Test pass ratio	0.50	TestRunner output parsed for `passed/total`
Lint cleanliness	0.20	pylint error count, normalized
Documentation found	0.15	ChromaDB vector retrieval result
Explanation quality	0.15	Keyword analysis: "because", "therefore", word count

The personality system adds three distinct acceptance thresholds:

Defensive (threshold 0.70): Requires overwhelming evidence. Will push back on test scores, lint scores, lack of docs, and vague explanations separately.
Junior (threshold 0.30): Accepts quickly once any reasonable argument is made.
Collaborative (threshold 0.50): Balanced — evidence-driven but not adversarial.

A stagnation penalty discourages the agent from repeating the same action: if the evidence score does not improve by at least 0.05 across two consecutive steps, confidence is penalized by 10%. This forces the agent to diversify its strategy when stuck.

The author's pushback messages are conditional on what specifically failed, teaching the agent to read and respond to targeted feedback:

Tests < 50% → "Tests are still failing. Show a passing case."
Lint errors > 0 → "There are N lint errors. Fix them."
No docs → "Provide documentation or reference."
No "because" → "Explain why this works."

2.3 The Tool Suite — `ToolBox`

All tools produce real outputs, not simulated strings.

Linter: Shells out to pylint in a subprocess on a temp file. Strips the rating line and returns the first 500 characters of warnings and errors. A normalized score (0–1) is computed by extracting the X.XX/10 rating via regex.

Test Runner: Dynamically detects the function name defined in the agent's fix code using ast.walk. Maps fine-grained bug IDs to canonical test families (null_check, off_by_one, division_by_zero, wrong_operator). Generates a test script at runtime — including fuzzing with fuzz_rounds=3 random test cases per bug family — and executes it in a subprocess. Returns a (score, output) tuple where score is passed/total.

Documentation Retrieval: Uses sentence-transformers (all-MiniLM-L6-v2) to embed the query, then queries a ChromaDB in-memory collection pre-loaded with real documentation snippets. Returns the top-3 most relevant docs with distance-ranked ordering.

2.4 The Observation Space — Fully Markov

A critical design decision was making the observation fully Markov. The EnhancedObservation dataclass exposes everything the agent needs to condition on without any hidden state:

@dataclass
class EnhancedObservation:
    code_snippet: str           # current (possibly patched) code
    last_tool_output: str       # last tool/author response
    author_response: str        # developer's verbal feedback

    current_test_score: float   # [0, 1]
    current_lint_score: float   # [0, 1]
    negotiation_score: float    # author's final acceptance probability

    previous_test_score: float  # for delta computation
    previous_lint_score: float

    author_confidence: float    # author's internal belief
    author_threshold: float     # acceptance threshold

    step: int
    max_steps: int
    progress_ratio: float       # step / max_steps

    tests_run: bool             # first-use tracking
    linter_run: bool
    docs_queried: bool

    last_action_type: str
    action_history: List[str]   # last 5 actions

    done: bool
    bug_description: str
    comments_count: int

This design prevents the policy from needing to maintain its own memory of whether it has run the linter — it can always read obs.linter_run directly. This significantly stabilizes training.

3. The Reward Architecture — Rubric Stack

The reward system is modular. Rather than a single reward function, a stack of Rubric objects each contribute a scalar that is summed:

final_reward = 0.4 × base_reward + Σ rubric_i(env, action, obs, info)

Clipped to [-1.0, 1.0] before backpropagation.

Rubric Breakdown

TestDeltaRubric (weight=0.3):
Rewards improvement in test score, not the absolute score. Δtest × 0.3. This prevents the agent from getting rewarded for accidentally high scores that do not result from its actions, and encourages incremental progress. Weight is halved when action is fix to prevent the agent from gaming rewards by repeatedly proposing untested fixes.

LintDeltaRubric (weight=0.3):
Same delta structure for lint. Weight at 0.15 effective (×0.5 scaling), because lint improvement is a weaker signal than test improvement.

TerminalSuccessRubric:
A large bonus only triggered on fix actions:

+0.4 if test score > 0.95 (near-perfect fix)
+0.2 if test score > 0.85 (good fix)

This is the primary signal that distinguishes a successful episode from a failed one.

ToolUsageRubric (bonus=0.05):
Encourages strategic tool use. Rewards first-use of run_tests and run_linter with a 0.05 bonus, and gives a micro-bonus (+0.015) for each subsequent use. Penalizes repeated query_docs calls after the first one (-0.01), since excessive documentation querying without progress is a sign of stuck behavior. Rewards question actions in early steps (≤3) with +0.02, encouraging the agent to gather information before acting.

ExplorationRubric:
Analyzes the last 3 actions. If all 3 are identical, applies a penalty (-0.05). If all 3 are unique, applies a bonus (+0.021). This directly penalizes repetitive behavior and rewards diverse, exploratory strategies.

AntiHackingRubric:
Prevents the agent from short-circuiting the evidence-gathering process:

-0.25 if fix is proposed without ever running tests
-0.10 if fix is proposed in the first 2 steps (too fast, no evidence)
+0.02 bonus if both tests AND linter have been run before fixing

This rubric is what prevents the degenerate policy of immediately proposing a done or fix action to collect terminal rewards without doing the work.

StepPenaltyRubric (penalty=-0.01):
Applied every step. Creates pressure to solve efficiently. Without this, the agent would learn to run query_docs indefinitely, collecting small bonuses without ever fixing anything.

4. The Model and Training Pipeline

4.1 Model Selection — Qwen2.5-1.5B via Unsloth

Qwen2.5-1.5B-Instruct was chosen for three practical reasons:

Fits in 4-bit on a T4 with room for gradient computation
Strong instruction-following baseline (critical for structured JSON output)
Unsloth's 2× throughput improvement makes 150 PPO iterations feasible in Colab

QLoRA configuration:

lora_r = 16, lora_alpha = 32
target_modules = [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
use_gradient_checkpointing = "unsloth"  # memory-efficient

4.2 Phase 1 — Supervised Warm-up

Before PPO, the model is warm-started on 100+ expert demonstrations. This is crucial: a cold LLM will generate random JSON, fail environment parsing, and never collect meaningful rewards. The demonstrations encode the optimal workflow:

inspect → run_tests → run_linter → query_docs → fix → comment → done

Bug Fix: Label Masking. Naive implementations compute cross-entropy loss on the entire sequence including the prompt. This teaches the model to predict its own prompt tokens, which is wasteful and slightly harmful. The correct implementation masks all prompt tokens with -100 (the PyTorch ignore_index):

def _masked_labels(input_ids, prompt_len):
    labels = input_ids.clone()
    labels[0, :prompt_len] = -100
    return labels

Bug Fix: BPE Boundary Safety. Tokenizing the prompt and action separately and concatenating the IDs is subtly wrong — the BPE tokenizer may split tokens differently at the boundary when encoding together vs separately. The correct approach tokenizes the full prompt + action string jointly, then measures the prompt length in the joint sequence:

prompt_ids = tokenizer(prompt_chat, ...)["input_ids"]
full_ids   = tokenizer(prompt_chat + action, ...).to(DEVICE)
prompt_len = min(prompt_ids.shape[1], full_ids["input_ids"].shape[1] - 1)
logits     = model(**full_ids).logits
lp, ent, n = _compute_action_logprob(logits, full_ids["input_ids"], prompt_len)

4.3 Phase 2 — PPO (150 Iterations)

The PPO implementation is token-level, operating on action token log-probabilities.

Training loop per iteration:

Collect trajs_per_iter = 4 trajectories using the current policy
For each trajectory, compute discounted returns with γ = 0.99
Compute a global mean baseline for variance reduction
For each state-action pair: compute new log-prob, compute clipped ratio, compute policy loss

Bug Fix: Log-Ratio Clamping. The standard PPO ratio exp(log π_new - log π_old) can explode to infinity if the log-probs diverge significantly. This produces NaN loss and kills training. The fix:

log_ratio = torch.clamp(new_lp - old_lp_t,
                        -CFG["log_ratio_clamp"],  # -5.0
                         CFG["log_ratio_clamp"])  # +5.0
ratio = torch.exp(log_ratio)

This bounds the ratio to [e^-5, e^5] ≈ [0.0067, 148], which is sufficient to allow meaningful policy updates while preventing numerical instability.

Temperature annealing: The generation temperature linearly decays from 0.8 → 0.1 over the 150 iterations. Early iterations need high exploration to discover diverse strategies; later iterations should be more deterministic to commit to learned behaviors.

temp(t) = 0.8 + (0.1 - 0.8) × (t / 149)

4.4 The Prompt Format

Every agent query follows a structured prompt designed to elicit JSON-only responses:

You are a code review agent. Convince the developer to accept your fix.

Developer personality: **defensive** (needs evidence).
Your fix function MUST be named `fix`.

Workflow:
1. `inspect`
2. `run_tests` and `run_linter`  
3. `query_docs` if needed
4. AFTER you have test + lint results, propose a fix (`fix`)
5. Explain why it works (`comment`)
6. Once the developer agrees, `done`

Code:
[buggy code]

Author: [developer's last message]

Last tool output:
[tool output]

Available actions: run_tests, run_linter, inspect, query_docs, fix, comment, question, done

Respond ONLY in JSON: {"action_type": "...", "content": "..."}

The IMPORTANT: Once you have test and lint results, you MUST propose a fix. line at the end was added to counter a failure mode discovered during initial training: the model would collect information indefinitely without ever proposing a fix, because the step penalty was too weak to overcome the comfort of tool-use bonuses.

5. Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                      CodeReviewEnv (POMDP)                          │
│                                                                     │
│  ┌──────────┐   inject_bug()   ┌──────────────────────────────────┐│
│  │ RedTeam  │ ──────────────►  │     Episode State                ││
│  │ (25 bugs │                  │  ┌────────────┐  ┌────────────┐  ││
│  │  5 tiers)│                  │  │ Buggy Code │  │ Comments[] │  ││
│  └──────────┘                  │  └────────────┘  └────────────┘  ││
│                                └──────────────────────────────────┘│
│                                           │                         │
│         ┌─────────────────────────────────┼───────────────┐        │
│         │           action                │               │        │
│         ▼                                 ▼               ▼        │
│  ┌─────────────┐               ┌──────────────┐   ┌────────────┐  │
│  │  ToolBox    │               │ PersonaAuthor│   │EnhancedObs │  │
│  │  ─────────  │               │  ─────────── │   │ (Markov)   │  │
│  │  run_linter │               │  confidence  │   └────────────┘  │
│  │  run_tests  │               │  belief      │                    │
│  │  query_docs │               │  personality │                    │
│  └─────────────┘               └──────────────┘                   │
│         │                               │                          │
│         └───────────┬───────────────────┘                          │
│                     ▼                                               │
│           ┌──────────────────┐                                     │
│           │   Rubric Stack   │                                     │
│           │  ─────────────── │                                     │
│           │  TestDeltaRubric │  ──►  final_reward ∈ [-1, 1]       │
│           │  LintDeltaRubric │                                     │
│           │  TerminalSuccess │                                     │
│           │  ToolUsage       │                                     │
│           │  Exploration     │                                     │
│           │  AntiHacking     │                                     │
│           │  StepPenalty     │                                     │
│           └──────────────────┘                                     │
└─────────────────────────────────────────────────────────────────────┘
                              │ reward, obs, done
                              ▼
          ┌────────────────────────────────────┐
          │    Qwen2.5-1.5B (QLoRA, 4-bit)     │
          │    ─────────────────────────────── │
          │    Phase 1: Supervised Warm-up      │
          │      → masked CE loss (action only) │
          │      → BPE-safe joint tokenization  │
          │                                     │
          │    Phase 2: PPO (150 iters)          │
          │      → token-level log-probs         │
          │      → log-ratio clamped at ±5       │
          │      → temp annealing 0.8 → 0.1      │
          │      → global mean baseline          │
          └────────────────────────────────────┘

6. Why Each Design Decision Matters

Why a vector-DB for docs instead of hardcoded strings?

The agent's query_docs action uses sentence-transformers + ChromaDB. This means the agent must formulate a semantically meaningful query — it cannot hardcode "GIL threading" and always get the same result. Different phrasings retrieve different snippets, making the retrieval signal informative. It also means the system is extensible: swap in a larger knowledge base without changing any training code.

Why track `prev_tests_run` before mutating env flags?

The ToolUsageRubric rewards "first use" of tools. If the rubric reads env._tests_run after the step has set it to True, it can never detect first-use correctly. The environment snapshots the pre-action flags explicitly and passes them in info:

prev_tests_run = self._tests_run   # before action
# ... execute action ...
info["prev_tests_run"] = prev_tests_run  # rubric uses this

This is a subtle but critical correctness fix.

Why separate `author_response` from `last_tool_output` in observation?

Early designs merged them. This caused the policy to conflate developer feedback with tool output — it would sometimes treat a linter warning as a developer response, or treat developer acceptance as a test result. Separating them into distinct observation fields makes the input semantically cleaner and stabilizes training significantly.

Why the `AntiHackingRubric`?

Without it, the optimal policy under pure sparse reward is: take done immediately, occasionally get lucky when the bug was injected incorrectly, collect small positive rewards. The anti-hacking rubric makes this strategy strongly negative (−0.25 for unverified fixes), forcing the agent to actually gather evidence.

7. Results

The training produces quantifiable improvements across three evaluation checkpoints:

Stage	Avg Reward	Success Rate	Δ Baseline
Baseline (untrained)	negative	~10%	—
Post-Warmup	improved	~35%	+significant
Final (PPO, 150 iter)	highest	~60%+	+large

Per-difficulty breakdown shows the expected curriculum pattern: easy and medium bugs are solved reliably after warmup; harder and hardest bugs require the full PPO training to show improvement, and there is still room to grow.

Action distribution shifts dramatically from baseline to final:

Baseline: random sampling across actions, frequent skip and done
Post-warmup: inspect → run_tests → fix pattern emerges
Final: full workflow inspect → run_tests → run_linter → query_docs → fix → comment → done appears with high frequency

KL divergence stays bounded (the log-ratio clamping is doing its job) and policy entropy decreases monotonically as the agent commits to a learned strategy — a healthy training signature.

8. What I Would Do With More Compute

Multi-turn PPO with author memory. Currently each episode starts with a fresh author. With a persistent author across related bugs, the agent would need to build reputation over multiple interactions — a much richer task.
Self-play bug injection. Train a secondary model to generate adversarial bugs that specifically defeat the current agent policy. Classic curriculum RL amplified by adversarial training.
Tool-augmented training at scale. Run the same pipeline with 7B or 13B parameter models, which should dramatically improve the quality of the fix action (the generated code itself) and enable harder concurrency bugs.
Real codebase integration. Replace the synthetic bug database with real GitHub PR diffs tagged by type. The agent would then face real variable names, real file structures, and real reviewer comments.
Multi-agent negotiation. Replace the rule-based author with a second RL agent that learns to give maximally useful pushback — turning the code review into a cooperative game between two learning agents.

9. Repository Structure

.
├── blog.md                    ← this file
├── yuvraj_openenv_hackathon_submission_colab_t4.ipynb
│   ├── [Cell 1]  pip install
│   ├── [Cell 2]  GPU check
│   ├── [Cell 3]  author.py + models.py + redteam.py +
│   │             tools.py + test_runner.py + rubrics.py +
│   │             environment.py + training.py  (all in one)
│   ├── [Cell 4]  CFG overrides (ppo_iters=150)
│   ├── [Cell 5]  Metric capture patch
│   ├── [Cell 6]  train()
│   ├── [Cell 7]  Display saved PNGs
│   ├── [Cell 8]  Plot 1 – Reward Curve
│   ├── [Cell 9]  Plot 2 – Comparison Curve
│   └── [Cell 10] Plot 3 – Loss Graph
├── training_summary.png       ← generated after train()
├── action_distribution.png    ← generated after train()
├── reward_curve.png           ← generated after train()
├── comparison_curve.png       ← generated after train()
└── loss_graph.png             ← generated after train()

10. Conclusion

This project demonstrates that a sub-2B parameter model can learn a complex, multi-step, tool-using, socially-aware code review workflow through a combination of:

Carefully designed environment that grounds every reward signal in real tool outputs
Modular rubric-based reward that shapes behavior without over-engineering a single reward function
Evidence-driven simulated developer who provides meaningful pushback that the agent must specifically address
Principled training pipeline with three correctness fixes (label masking, BPE-safe tokenization, log-ratio clamping) that prevent common failure modes in RL-from-language-model training

The core insight is that code review is not a retrieval problem or a generation problem — it is a negotiation problem that requires planning, evidence gathering, and adaptive communication. Reinforcement learning is the right framework for this, and a small, well-trained model with the right environment can make surprising progress.

Built for the OpenEnv Hackathon. All training runs on a free Colab T4 GPU.
HuggingFace: hackerone.com/10zxz01