---
title: verifiable-rl-coder
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.56.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: GRPO-trained Qwen-1.5B coder with sandboxed test execution
tags:
  - code-generation
  - reinforcement-learning
  - grpo
  - verifiable-rewards
  - lora
  - qwen
  - educational
  - reproducible-research
---

# verifiable-rl-coder

> **Live, side-by-side comparison of a base coding LLM, an SFT fine-tune, and
> a GRPO-trained model — with the full sandboxed test-execution pipeline
> running in the browser.**

This Space is the interactive front-end to a complete open implementation of
the **verifiable-reward RL post-training** technique behind DeepSeek-R1, the
OpenAI o-series, and Kimi-K1.5 — applied to a small open coding model
(Qwen-2.5-Coder-1.5B). Everything is open: weights, training code, evaluation
harness, and the multi-week debugging log of what actually broke and how it
got fixed.

## Try it

1. Pick **Compare (side-by-side)** in the sidebar.
2. Choose **Base + SFT** (and **GRPO** once that's available).
3. Use a pre-filled example or write your own coding task + assert tests.
4. Click **Generate + run tests** — watch each model produce a solution and
   run it against your tests in a sandboxed Python subprocess.

A few prompts that cleanly differentiate the models:

- **Roman numeral conversion** — base often forgets the subtractive-notation
  pairs (IV, IX, XL, XC, CD, CM); SFT learned them
- **Closest-to-zero with tie-breaking** — both fail, but in qualitatively
  different ways (base writes structurally invalid code; SFT writes the
  correct algorithm with one inverted comparison)
- **Array rotation with k > len** — both miss the modulo; this is exactly
  the kind of edge-case GRPO is designed to catch via test-execution feedback

Full prompt gallery + reproducible recipes:
[DEMO_EXAMPLES.md](https://github.com/Devesh-Maheshwari/verifiable-rl-coder/blob/main/docs/DEMO_EXAMPLES.md).

## What this demonstrates

| Concept | How the Space shows it |
|---|---|
| **Verifiable rewards** | Every generated solution is parsed, executed, and scored by real test runs — visible to the user, not abstracted |
| **The SFT → GRPO progression** | Three models in one UI; you see what each stage of post-training adds |
| **Reward hacking is real** | Some prompts produce code that "looks right" but fails edge cases — the live sandbox catches it on the spot |
| **Small models can be improved** | LoRA-rank-16 SFT on 319 prompts gives +1.1 pts HumanEval+; GRPO targets the remaining shared blind spots |

## Technical approach

```
                ┌──────────────────────────────────┐
                │      Streamlit UI (this Space)   │
                │  prompt + tests → patches + runs │
                └──────────────┬───────────────────┘
                               │
                ┌──────────────▼───────────────────┐
                │           Proposer               │
                │  Qwen-1.5B / +LoRA-SFT / +GRPO   │
                └──────────────┬───────────────────┘
                               │
                ┌──────────────▼───────────────────┐
                │           Verifier               │
                │  subprocess pytest in sandbox    │
                │  (5s timeout, isolated workdir)  │
                └──────────────┬───────────────────┘
                               │ pass/fail + composite reward
                               ▼
   Offline (training, not in this Space):
   GRPO rollout buffer → reward → group-relative advantage → LoRA update
```

**Composite reward** — `R = 1.0·correctness + 0.05·lint + 0.05·runtime + 0.01·length`.
Correctness uses real test execution (binary pass/fail in `[0, 1]`). The 20×
weight ratio between correctness and each auxiliary signal mechanically
prevents reward hacking via short-but-wrong code or lint-clean stubs. Full
breakdown:
[REWARD_DESIGN.md](https://github.com/Devesh-Maheshwari/verifiable-rl-coder/blob/main/docs/REWARD_DESIGN.md).

**Training data** — 319 MBPP-train prompts contamination-filtered against
MBPP+ test set, 2,580 rejection-sampled solutions kept after sandboxed test
execution.

**KL configuration** — DeepSeek-R1 style (KL added to loss, not reward),
`kl_loss_coef = 0.04`, `kl_loss_type = low_var_kl`. Tighter than R1's 0.001
default as a defensive choice given the small training set.

## Results (current snapshot)

Evaluated with [evalplus](https://github.com/evalplus/evalplus) at
temperature 0.2, n=5 samples per task.

| Model | HumanEval+ pass@1 | HumanEval+ pass@5 |
|---|---|---|
| Base Qwen-2.5-Coder-1.5B | 0.6268 | 0.7073 |
| LoRA SFT (this work) | **0.6378** | 0.6951 |
| GRPO (training in progress) | TBD | TBD |

The SFT delta is statistically modest (~3.8 pt noise floor for n=164) —
documented honestly in the model card. The qualitative analysis in
DEMO_EXAMPLES.md shows where the gain comes from: targeted improvements on
problems requiring specific structured-knowledge patterns (Roman numeral
subtractive notation, edge-case-aware list operations), with non-destructive
behavior on the ~70% of problems where base was already correct.

## Open artifacts

- **SFT model**: [dmaheshwar22/qwen-1.5b-coder-sft-v1](https://huggingface.co/dmaheshwar22/qwen-1.5b-coder-sft-v1)
  (LoRA adapter, 17 MB, full model card with training data + hyperparameters)
- **GRPO model**: [dmaheshwar22/qwen-1.5b-coder-grpo-v1](https://huggingface.co/dmaheshwar22/qwen-1.5b-coder-grpo-v1)
  (will go live once 200-step training completes)
- **Source code**: [github.com/Devesh-Maheshwari/verifiable-rl-coder](https://github.com/Devesh-Maheshwari/verifiable-rl-coder)
- **Training infrastructure**: HTCondor submit scripts, verl + vLLM training
  config, sandbox runner with seccomp / resource limits — all in the repo
- **Debugging log**: [grpo-chtc-debugging-log.md](https://github.com/Devesh-Maheshwari/verifiable-rl-coder/blob/main/docs/grpo-chtc-debugging-log.md)
  documents the 12 distinct failure modes hit while getting verl + CHTC + vLLM
  + Hydra to actually run a training step. Most ML projects hide this; we
  publish it because it's the most useful artifact for someone trying to
  reproduce the work.

## Why ZeroGPU would meaningfully improve this Space

Running on CPU basic, generating a single 512-token response takes 30–60
seconds. The side-by-side compare mode triggers two such generations
sequentially — so a recruiter or researcher exploring the demo waits
~90 seconds per click. That latency throws away the demo's actual value:
you can't *feel* the model differences when each comparison takes minutes.

ZeroGPU would change this from "leave a tab open and check back" to
"interactive exploration." A T4 / A10 with vLLM does 1.5B inference at
~50 tokens/sec — generations land in 2–4 seconds. The user can run the
full DEMO_EXAMPLES.md gallery in 5 minutes instead of 45.

This particularly matters for the **comparison-driven** nature of this work.
The whole pitch is "see how SFT and GRPO change behavior on the same prompt"
— that observation is qualitative and requires multiple side-by-side runs.
Slow inference makes it impractical at any scale.

## Limitations (honest)

- **1.5B parameters** — competent on isolated functions, weak on multi-file
  repositories or large-context reasoning. Don't expect SWE-bench wins.
- **319-prompt training set** — small; gains are bounded; we surface this
  explicitly in REWARD_DESIGN.md rather than oversell.
- **MBPP-shape distribution** — model is best on problems matching its
  training distribution (algorithmic Python functions with assert tests).
  Less reliable for systems code, async, or competitive-programming-heavy
  problems.
- **Inherits Qwen-2.5-Coder base properties** — including any biases or
  safety properties of the upstream model.
- **CPU inference is slow** — see "Why ZeroGPU" above.

## Educational value

This Space + the connected GitHub repo + the model card together form a
**complete reference implementation** of small-scale verifiable-reward RL
post-training. Specifically useful for:

- Researchers / students who want to **read the full pipeline** end-to-end
  without paywalls or proprietary internals
- Engineers studying **how reward hacking is prevented** mechanically
  (weight ratios in composite reward, KL configuration, length monitoring)
- Anyone investigating **why small fine-tunes plateau** and what GRPO is
  designed to fix beyond imitation learning

The connected docs (REWARD_DESIGN.md, DEMO_EXAMPLES.md, the debugging log,
ablation tables in EVAL_RESULTS.md) are written specifically to be readable
without prior frontier-RL context. The composite reward formula, the
DeepSeek-R1 KL configuration choice, the sampling-temperature observation
on base/SFT comparison — all explained from first principles.

## Acknowledgements

Built on:
- [Qwen-2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) — base model
- [verl](https://github.com/volcengine/verl) — production GRPO trainer
- [vLLM](https://github.com/vllm-project/vllm) — fast rollout engine
- [TRL](https://github.com/huggingface/trl) + [PEFT](https://github.com/huggingface/peft) — SFT + LoRA
- [evalplus](https://github.com/evalplus/evalplus) — robust HumanEval+/MBPP+ evaluation
- [DeepSeek-R1](https://arxiv.org/abs/2501.12948) — methodology reference

Trained on the [UW-Madison CHTC](https://chtc.cs.wisc.edu/) cluster.

---

**Citation**

```bibtex
@misc{maheshwari2026verifiable,
  title  = {verifiable-rl-coder: GRPO post-training of small coding LLMs with sandboxed test-execution rewards},
  author = {Maheshwari, Devesh},
  year   = {2026},
  url    = {https://github.com/Devesh-Maheshwari/verifiable-rl-coder}
}
```