--- title: verifiable-rl-coder emoji: πŸ€– colorFrom: blue colorTo: green sdk: streamlit sdk_version: 1.56.0 app_file: app.py pinned: false license: apache-2.0 short_description: GRPO-trained Qwen-1.5B coder with sandboxed test execution tags: - code-generation - reinforcement-learning - grpo - verifiable-rewards - lora - qwen - educational - reproducible-research --- # verifiable-rl-coder > **Live, side-by-side comparison of a base coding LLM, an SFT fine-tune, and > a GRPO-trained model β€” with the full sandboxed test-execution pipeline > running in the browser.** This Space is the interactive front-end to a complete open implementation of the **verifiable-reward RL post-training** technique behind DeepSeek-R1, the OpenAI o-series, and Kimi-K1.5 β€” applied to a small open coding model (Qwen-2.5-Coder-1.5B). Everything is open: weights, training code, evaluation harness, and the multi-week debugging log of what actually broke and how it got fixed. ## Try it 1. Pick **Compare (side-by-side)** in the sidebar. 2. Choose **Base + SFT** (and **GRPO** once that's available). 3. Use a pre-filled example or write your own coding task + assert tests. 4. Click **Generate + run tests** β€” watch each model produce a solution and run it against your tests in a sandboxed Python subprocess. A few prompts that cleanly differentiate the models: - **Roman numeral conversion** β€” base often forgets the subtractive-notation pairs (IV, IX, XL, XC, CD, CM); SFT learned them - **Closest-to-zero with tie-breaking** β€” both fail, but in qualitatively different ways (base writes structurally invalid code; SFT writes the correct algorithm with one inverted comparison) - **Array rotation with k > len** β€” both miss the modulo; this is exactly the kind of edge-case GRPO is designed to catch via test-execution feedback Full prompt gallery + reproducible recipes: [DEMO_EXAMPLES.md](https://github.com/Devesh-Maheshwari/verifiable-rl-coder/blob/main/docs/DEMO_EXAMPLES.md). ## What this demonstrates | Concept | How the Space shows it | |---|---| | **Verifiable rewards** | Every generated solution is parsed, executed, and scored by real test runs β€” visible to the user, not abstracted | | **The SFT β†’ GRPO progression** | Three models in one UI; you see what each stage of post-training adds | | **Reward hacking is real** | Some prompts produce code that "looks right" but fails edge cases β€” the live sandbox catches it on the spot | | **Small models can be improved** | LoRA-rank-16 SFT on 319 prompts gives +1.1 pts HumanEval+; GRPO targets the remaining shared blind spots | ## Technical approach ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Streamlit UI (this Space) β”‚ β”‚ prompt + tests β†’ patches + runs β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Proposer β”‚ β”‚ Qwen-1.5B / +LoRA-SFT / +GRPO β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Verifier β”‚ β”‚ subprocess pytest in sandbox β”‚ β”‚ (5s timeout, isolated workdir) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ pass/fail + composite reward β–Ό Offline (training, not in this Space): GRPO rollout buffer β†’ reward β†’ group-relative advantage β†’ LoRA update ``` **Composite reward** β€” `R = 1.0Β·correctness + 0.05Β·lint + 0.05Β·runtime + 0.01Β·length`. Correctness uses real test execution (binary pass/fail in `[0, 1]`). The 20Γ— weight ratio between correctness and each auxiliary signal mechanically prevents reward hacking via short-but-wrong code or lint-clean stubs. Full breakdown: [REWARD_DESIGN.md](https://github.com/Devesh-Maheshwari/verifiable-rl-coder/blob/main/docs/REWARD_DESIGN.md). **Training data** β€” 319 MBPP-train prompts contamination-filtered against MBPP+ test set, 2,580 rejection-sampled solutions kept after sandboxed test execution. **KL configuration** β€” DeepSeek-R1 style (KL added to loss, not reward), `kl_loss_coef = 0.04`, `kl_loss_type = low_var_kl`. Tighter than R1's 0.001 default as a defensive choice given the small training set. ## Results (current snapshot) Evaluated with [evalplus](https://github.com/evalplus/evalplus) at temperature 0.2, n=5 samples per task. | Model | HumanEval+ pass@1 | HumanEval+ pass@5 | |---|---|---| | Base Qwen-2.5-Coder-1.5B | 0.6268 | 0.7073 | | LoRA SFT (this work) | **0.6378** | 0.6951 | | GRPO (training in progress) | TBD | TBD | The SFT delta is statistically modest (~3.8 pt noise floor for n=164) β€” documented honestly in the model card. The qualitative analysis in DEMO_EXAMPLES.md shows where the gain comes from: targeted improvements on problems requiring specific structured-knowledge patterns (Roman numeral subtractive notation, edge-case-aware list operations), with non-destructive behavior on the ~70% of problems where base was already correct. ## Open artifacts - **SFT model**: [dmaheshwar22/qwen-1.5b-coder-sft-v1](https://huggingface.co/dmaheshwar22/qwen-1.5b-coder-sft-v1) (LoRA adapter, 17 MB, full model card with training data + hyperparameters) - **GRPO model**: [dmaheshwar22/qwen-1.5b-coder-grpo-v1](https://huggingface.co/dmaheshwar22/qwen-1.5b-coder-grpo-v1) (will go live once 200-step training completes) - **Source code**: [github.com/Devesh-Maheshwari/verifiable-rl-coder](https://github.com/Devesh-Maheshwari/verifiable-rl-coder) - **Training infrastructure**: HTCondor submit scripts, verl + vLLM training config, sandbox runner with seccomp / resource limits β€” all in the repo - **Debugging log**: [grpo-chtc-debugging-log.md](https://github.com/Devesh-Maheshwari/verifiable-rl-coder/blob/main/docs/grpo-chtc-debugging-log.md) documents the 12 distinct failure modes hit while getting verl + CHTC + vLLM + Hydra to actually run a training step. Most ML projects hide this; we publish it because it's the most useful artifact for someone trying to reproduce the work. ## Why ZeroGPU would meaningfully improve this Space Running on CPU basic, generating a single 512-token response takes 30–60 seconds. The side-by-side compare mode triggers two such generations sequentially β€” so a recruiter or researcher exploring the demo waits ~90 seconds per click. That latency throws away the demo's actual value: you can't *feel* the model differences when each comparison takes minutes. ZeroGPU would change this from "leave a tab open and check back" to "interactive exploration." A T4 / A10 with vLLM does 1.5B inference at ~50 tokens/sec β€” generations land in 2–4 seconds. The user can run the full DEMO_EXAMPLES.md gallery in 5 minutes instead of 45. This particularly matters for the **comparison-driven** nature of this work. The whole pitch is "see how SFT and GRPO change behavior on the same prompt" β€” that observation is qualitative and requires multiple side-by-side runs. Slow inference makes it impractical at any scale. ## Limitations (honest) - **1.5B parameters** β€” competent on isolated functions, weak on multi-file repositories or large-context reasoning. Don't expect SWE-bench wins. - **319-prompt training set** β€” small; gains are bounded; we surface this explicitly in REWARD_DESIGN.md rather than oversell. - **MBPP-shape distribution** β€” model is best on problems matching its training distribution (algorithmic Python functions with assert tests). Less reliable for systems code, async, or competitive-programming-heavy problems. - **Inherits Qwen-2.5-Coder base properties** β€” including any biases or safety properties of the upstream model. - **CPU inference is slow** β€” see "Why ZeroGPU" above. ## Educational value This Space + the connected GitHub repo + the model card together form a **complete reference implementation** of small-scale verifiable-reward RL post-training. Specifically useful for: - Researchers / students who want to **read the full pipeline** end-to-end without paywalls or proprietary internals - Engineers studying **how reward hacking is prevented** mechanically (weight ratios in composite reward, KL configuration, length monitoring) - Anyone investigating **why small fine-tunes plateau** and what GRPO is designed to fix beyond imitation learning The connected docs (REWARD_DESIGN.md, DEMO_EXAMPLES.md, the debugging log, ablation tables in EVAL_RESULTS.md) are written specifically to be readable without prior frontier-RL context. The composite reward formula, the DeepSeek-R1 KL configuration choice, the sampling-temperature observation on base/SFT comparison β€” all explained from first principles. ## Acknowledgements Built on: - [Qwen-2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) β€” base model - [verl](https://github.com/volcengine/verl) β€” production GRPO trainer - [vLLM](https://github.com/vllm-project/vllm) β€” fast rollout engine - [TRL](https://github.com/huggingface/trl) + [PEFT](https://github.com/huggingface/peft) β€” SFT + LoRA - [evalplus](https://github.com/evalplus/evalplus) β€” robust HumanEval+/MBPP+ evaluation - [DeepSeek-R1](https://arxiv.org/abs/2501.12948) β€” methodology reference Trained on the [UW-Madison CHTC](https://chtc.cs.wisc.edu/) cluster. --- **Citation** ```bibtex @misc{maheshwari2026verifiable, title = {verifiable-rl-coder: GRPO post-training of small coding LLMs with sandboxed test-execution rewards}, author = {Maheshwari, Devesh}, year = {2026}, url = {https://github.com/Devesh-Maheshwari/verifiable-rl-coder} } ```