Spaces:
Sleeping
Sleeping
| title: verifiable-rl-coder | |
| emoji: π€ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: streamlit | |
| sdk_version: 1.56.0 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: GRPO-trained Qwen-1.5B coder with sandboxed test execution | |
| tags: | |
| - code-generation | |
| - reinforcement-learning | |
| - grpo | |
| - verifiable-rewards | |
| - lora | |
| - qwen | |
| - educational | |
| - reproducible-research | |
| # verifiable-rl-coder | |
| > **Live, side-by-side comparison of a base coding LLM, an SFT fine-tune, and | |
| > a GRPO-trained model β with the full sandboxed test-execution pipeline | |
| > running in the browser.** | |
| This Space is the interactive front-end to a complete open implementation of | |
| the **verifiable-reward RL post-training** technique behind DeepSeek-R1, the | |
| OpenAI o-series, and Kimi-K1.5 β applied to a small open coding model | |
| (Qwen-2.5-Coder-1.5B). Everything is open: weights, training code, evaluation | |
| harness, and the multi-week debugging log of what actually broke and how it | |
| got fixed. | |
| ## Try it | |
| 1. Pick **Compare (side-by-side)** in the sidebar. | |
| 2. Choose **Base + SFT** (and **GRPO** once that's available). | |
| 3. Use a pre-filled example or write your own coding task + assert tests. | |
| 4. Click **Generate + run tests** β watch each model produce a solution and | |
| run it against your tests in a sandboxed Python subprocess. | |
| A few prompts that cleanly differentiate the models: | |
| - **Roman numeral conversion** β base often forgets the subtractive-notation | |
| pairs (IV, IX, XL, XC, CD, CM); SFT learned them | |
| - **Closest-to-zero with tie-breaking** β both fail, but in qualitatively | |
| different ways (base writes structurally invalid code; SFT writes the | |
| correct algorithm with one inverted comparison) | |
| - **Array rotation with k > len** β both miss the modulo; this is exactly | |
| the kind of edge-case GRPO is designed to catch via test-execution feedback | |
| Full prompt gallery + reproducible recipes: | |
| [DEMO_EXAMPLES.md](https://github.com/Devesh-Maheshwari/verifiable-rl-coder/blob/main/docs/DEMO_EXAMPLES.md). | |
| ## What this demonstrates | |
| | Concept | How the Space shows it | | |
| |---|---| | |
| | **Verifiable rewards** | Every generated solution is parsed, executed, and scored by real test runs β visible to the user, not abstracted | | |
| | **The SFT β GRPO progression** | Three models in one UI; you see what each stage of post-training adds | | |
| | **Reward hacking is real** | Some prompts produce code that "looks right" but fails edge cases β the live sandbox catches it on the spot | | |
| | **Small models can be improved** | LoRA-rank-16 SFT on 319 prompts gives +1.1 pts HumanEval+; GRPO targets the remaining shared blind spots | | |
| ## Technical approach | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββ | |
| β Streamlit UI (this Space) β | |
| β prompt + tests β patches + runs β | |
| ββββββββββββββββ¬ββββββββββββββββββββ | |
| β | |
| ββββββββββββββββΌββββββββββββββββββββ | |
| β Proposer β | |
| β Qwen-1.5B / +LoRA-SFT / +GRPO β | |
| ββββββββββββββββ¬ββββββββββββββββββββ | |
| β | |
| ββββββββββββββββΌββββββββββββββββββββ | |
| β Verifier β | |
| β subprocess pytest in sandbox β | |
| β (5s timeout, isolated workdir) β | |
| ββββββββββββββββ¬ββββββββββββββββββββ | |
| β pass/fail + composite reward | |
| βΌ | |
| Offline (training, not in this Space): | |
| GRPO rollout buffer β reward β group-relative advantage β LoRA update | |
| ``` | |
| **Composite reward** β `R = 1.0Β·correctness + 0.05Β·lint + 0.05Β·runtime + 0.01Β·length`. | |
| Correctness uses real test execution (binary pass/fail in `[0, 1]`). The 20Γ | |
| weight ratio between correctness and each auxiliary signal mechanically | |
| prevents reward hacking via short-but-wrong code or lint-clean stubs. Full | |
| breakdown: | |
| [REWARD_DESIGN.md](https://github.com/Devesh-Maheshwari/verifiable-rl-coder/blob/main/docs/REWARD_DESIGN.md). | |
| **Training data** β 319 MBPP-train prompts contamination-filtered against | |
| MBPP+ test set, 2,580 rejection-sampled solutions kept after sandboxed test | |
| execution. | |
| **KL configuration** β DeepSeek-R1 style (KL added to loss, not reward), | |
| `kl_loss_coef = 0.04`, `kl_loss_type = low_var_kl`. Tighter than R1's 0.001 | |
| default as a defensive choice given the small training set. | |
| ## Results (current snapshot) | |
| Evaluated with [evalplus](https://github.com/evalplus/evalplus) at | |
| temperature 0.2, n=5 samples per task. | |
| | Model | HumanEval+ pass@1 | HumanEval+ pass@5 | | |
| |---|---|---| | |
| | Base Qwen-2.5-Coder-1.5B | 0.6268 | 0.7073 | | |
| | LoRA SFT (this work) | **0.6378** | 0.6951 | | |
| | GRPO (training in progress) | TBD | TBD | | |
| The SFT delta is statistically modest (~3.8 pt noise floor for n=164) β | |
| documented honestly in the model card. The qualitative analysis in | |
| DEMO_EXAMPLES.md shows where the gain comes from: targeted improvements on | |
| problems requiring specific structured-knowledge patterns (Roman numeral | |
| subtractive notation, edge-case-aware list operations), with non-destructive | |
| behavior on the ~70% of problems where base was already correct. | |
| ## Open artifacts | |
| - **SFT model**: [dmaheshwar22/qwen-1.5b-coder-sft-v1](https://huggingface.co/dmaheshwar22/qwen-1.5b-coder-sft-v1) | |
| (LoRA adapter, 17 MB, full model card with training data + hyperparameters) | |
| - **GRPO model**: [dmaheshwar22/qwen-1.5b-coder-grpo-v1](https://huggingface.co/dmaheshwar22/qwen-1.5b-coder-grpo-v1) | |
| (will go live once 200-step training completes) | |
| - **Source code**: [github.com/Devesh-Maheshwari/verifiable-rl-coder](https://github.com/Devesh-Maheshwari/verifiable-rl-coder) | |
| - **Training infrastructure**: HTCondor submit scripts, verl + vLLM training | |
| config, sandbox runner with seccomp / resource limits β all in the repo | |
| - **Debugging log**: [grpo-chtc-debugging-log.md](https://github.com/Devesh-Maheshwari/verifiable-rl-coder/blob/main/docs/grpo-chtc-debugging-log.md) | |
| documents the 12 distinct failure modes hit while getting verl + CHTC + vLLM | |
| + Hydra to actually run a training step. Most ML projects hide this; we | |
| publish it because it's the most useful artifact for someone trying to | |
| reproduce the work. | |
| ## Why ZeroGPU would meaningfully improve this Space | |
| Running on CPU basic, generating a single 512-token response takes 30β60 | |
| seconds. The side-by-side compare mode triggers two such generations | |
| sequentially β so a recruiter or researcher exploring the demo waits | |
| ~90 seconds per click. That latency throws away the demo's actual value: | |
| you can't *feel* the model differences when each comparison takes minutes. | |
| ZeroGPU would change this from "leave a tab open and check back" to | |
| "interactive exploration." A T4 / A10 with vLLM does 1.5B inference at | |
| ~50 tokens/sec β generations land in 2β4 seconds. The user can run the | |
| full DEMO_EXAMPLES.md gallery in 5 minutes instead of 45. | |
| This particularly matters for the **comparison-driven** nature of this work. | |
| The whole pitch is "see how SFT and GRPO change behavior on the same prompt" | |
| β that observation is qualitative and requires multiple side-by-side runs. | |
| Slow inference makes it impractical at any scale. | |
| ## Limitations (honest) | |
| - **1.5B parameters** β competent on isolated functions, weak on multi-file | |
| repositories or large-context reasoning. Don't expect SWE-bench wins. | |
| - **319-prompt training set** β small; gains are bounded; we surface this | |
| explicitly in REWARD_DESIGN.md rather than oversell. | |
| - **MBPP-shape distribution** β model is best on problems matching its | |
| training distribution (algorithmic Python functions with assert tests). | |
| Less reliable for systems code, async, or competitive-programming-heavy | |
| problems. | |
| - **Inherits Qwen-2.5-Coder base properties** β including any biases or | |
| safety properties of the upstream model. | |
| - **CPU inference is slow** β see "Why ZeroGPU" above. | |
| ## Educational value | |
| This Space + the connected GitHub repo + the model card together form a | |
| **complete reference implementation** of small-scale verifiable-reward RL | |
| post-training. Specifically useful for: | |
| - Researchers / students who want to **read the full pipeline** end-to-end | |
| without paywalls or proprietary internals | |
| - Engineers studying **how reward hacking is prevented** mechanically | |
| (weight ratios in composite reward, KL configuration, length monitoring) | |
| - Anyone investigating **why small fine-tunes plateau** and what GRPO is | |
| designed to fix beyond imitation learning | |
| The connected docs (REWARD_DESIGN.md, DEMO_EXAMPLES.md, the debugging log, | |
| ablation tables in EVAL_RESULTS.md) are written specifically to be readable | |
| without prior frontier-RL context. The composite reward formula, the | |
| DeepSeek-R1 KL configuration choice, the sampling-temperature observation | |
| on base/SFT comparison β all explained from first principles. | |
| ## Acknowledgements | |
| Built on: | |
| - [Qwen-2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) β base model | |
| - [verl](https://github.com/volcengine/verl) β production GRPO trainer | |
| - [vLLM](https://github.com/vllm-project/vllm) β fast rollout engine | |
| - [TRL](https://github.com/huggingface/trl) + [PEFT](https://github.com/huggingface/peft) β SFT + LoRA | |
| - [evalplus](https://github.com/evalplus/evalplus) β robust HumanEval+/MBPP+ evaluation | |
| - [DeepSeek-R1](https://arxiv.org/abs/2501.12948) β methodology reference | |
| Trained on the [UW-Madison CHTC](https://chtc.cs.wisc.edu/) cluster. | |
| --- | |
| **Citation** | |
| ```bibtex | |
| @misc{maheshwari2026verifiable, | |
| title = {verifiable-rl-coder: GRPO post-training of small coding LLMs with sandboxed test-execution rewards}, | |
| author = {Maheshwari, Devesh}, | |
| year = {2026}, | |
| url = {https://github.com/Devesh-Maheshwari/verifiable-rl-coder} | |
| } | |
| ``` | |