verifiable-rl-coder / README.md
dmaheshwar22's picture
Updated Readme file
b0fd550 verified

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade
metadata
title: verifiable-rl-coder
emoji: πŸ€–
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.56.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: GRPO-trained Qwen-1.5B coder with sandboxed test execution
tags:
  - code-generation
  - reinforcement-learning
  - grpo
  - verifiable-rewards
  - lora
  - qwen
  - educational
  - reproducible-research

verifiable-rl-coder

Live, side-by-side comparison of a base coding LLM, an SFT fine-tune, and a GRPO-trained model β€” with the full sandboxed test-execution pipeline running in the browser.

This Space is the interactive front-end to a complete open implementation of the verifiable-reward RL post-training technique behind DeepSeek-R1, the OpenAI o-series, and Kimi-K1.5 β€” applied to a small open coding model (Qwen-2.5-Coder-1.5B). Everything is open: weights, training code, evaluation harness, and the multi-week debugging log of what actually broke and how it got fixed.

Try it

  1. Pick Compare (side-by-side) in the sidebar.
  2. Choose Base + SFT (and GRPO once that's available).
  3. Use a pre-filled example or write your own coding task + assert tests.
  4. Click Generate + run tests β€” watch each model produce a solution and run it against your tests in a sandboxed Python subprocess.

A few prompts that cleanly differentiate the models:

  • Roman numeral conversion β€” base often forgets the subtractive-notation pairs (IV, IX, XL, XC, CD, CM); SFT learned them
  • Closest-to-zero with tie-breaking β€” both fail, but in qualitatively different ways (base writes structurally invalid code; SFT writes the correct algorithm with one inverted comparison)
  • Array rotation with k > len β€” both miss the modulo; this is exactly the kind of edge-case GRPO is designed to catch via test-execution feedback

Full prompt gallery + reproducible recipes: DEMO_EXAMPLES.md.

What this demonstrates

Concept How the Space shows it
Verifiable rewards Every generated solution is parsed, executed, and scored by real test runs β€” visible to the user, not abstracted
The SFT β†’ GRPO progression Three models in one UI; you see what each stage of post-training adds
Reward hacking is real Some prompts produce code that "looks right" but fails edge cases β€” the live sandbox catches it on the spot
Small models can be improved LoRA-rank-16 SFT on 319 prompts gives +1.1 pts HumanEval+; GRPO targets the remaining shared blind spots

Technical approach

                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚      Streamlit UI (this Space)   β”‚
                β”‚  prompt + tests β†’ patches + runs β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚           Proposer               β”‚
                β”‚  Qwen-1.5B / +LoRA-SFT / +GRPO   β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚           Verifier               β”‚
                β”‚  subprocess pytest in sandbox    β”‚
                β”‚  (5s timeout, isolated workdir)  β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚ pass/fail + composite reward
                               β–Ό
   Offline (training, not in this Space):
   GRPO rollout buffer β†’ reward β†’ group-relative advantage β†’ LoRA update

Composite reward β€” R = 1.0Β·correctness + 0.05Β·lint + 0.05Β·runtime + 0.01Β·length. Correctness uses real test execution (binary pass/fail in [0, 1]). The 20Γ— weight ratio between correctness and each auxiliary signal mechanically prevents reward hacking via short-but-wrong code or lint-clean stubs. Full breakdown: REWARD_DESIGN.md.

Training data β€” 319 MBPP-train prompts contamination-filtered against MBPP+ test set, 2,580 rejection-sampled solutions kept after sandboxed test execution.

KL configuration β€” DeepSeek-R1 style (KL added to loss, not reward), kl_loss_coef = 0.04, kl_loss_type = low_var_kl. Tighter than R1's 0.001 default as a defensive choice given the small training set.

Results (current snapshot)

Evaluated with evalplus at temperature 0.2, n=5 samples per task.

Model HumanEval+ pass@1 HumanEval+ pass@5
Base Qwen-2.5-Coder-1.5B 0.6268 0.7073
LoRA SFT (this work) 0.6378 0.6951
GRPO (training in progress) TBD TBD

The SFT delta is statistically modest (~3.8 pt noise floor for n=164) β€” documented honestly in the model card. The qualitative analysis in DEMO_EXAMPLES.md shows where the gain comes from: targeted improvements on problems requiring specific structured-knowledge patterns (Roman numeral subtractive notation, edge-case-aware list operations), with non-destructive behavior on the ~70% of problems where base was already correct.

Open artifacts

Why ZeroGPU would meaningfully improve this Space

Running on CPU basic, generating a single 512-token response takes 30–60 seconds. The side-by-side compare mode triggers two such generations sequentially β€” so a recruiter or researcher exploring the demo waits ~90 seconds per click. That latency throws away the demo's actual value: you can't feel the model differences when each comparison takes minutes.

ZeroGPU would change this from "leave a tab open and check back" to "interactive exploration." A T4 / A10 with vLLM does 1.5B inference at ~50 tokens/sec β€” generations land in 2–4 seconds. The user can run the full DEMO_EXAMPLES.md gallery in 5 minutes instead of 45.

This particularly matters for the comparison-driven nature of this work. The whole pitch is "see how SFT and GRPO change behavior on the same prompt" β€” that observation is qualitative and requires multiple side-by-side runs. Slow inference makes it impractical at any scale.

Limitations (honest)

  • 1.5B parameters β€” competent on isolated functions, weak on multi-file repositories or large-context reasoning. Don't expect SWE-bench wins.
  • 319-prompt training set β€” small; gains are bounded; we surface this explicitly in REWARD_DESIGN.md rather than oversell.
  • MBPP-shape distribution β€” model is best on problems matching its training distribution (algorithmic Python functions with assert tests). Less reliable for systems code, async, or competitive-programming-heavy problems.
  • Inherits Qwen-2.5-Coder base properties β€” including any biases or safety properties of the upstream model.
  • CPU inference is slow β€” see "Why ZeroGPU" above.

Educational value

This Space + the connected GitHub repo + the model card together form a complete reference implementation of small-scale verifiable-reward RL post-training. Specifically useful for:

  • Researchers / students who want to read the full pipeline end-to-end without paywalls or proprietary internals
  • Engineers studying how reward hacking is prevented mechanically (weight ratios in composite reward, KL configuration, length monitoring)
  • Anyone investigating why small fine-tunes plateau and what GRPO is designed to fix beyond imitation learning

The connected docs (REWARD_DESIGN.md, DEMO_EXAMPLES.md, the debugging log, ablation tables in EVAL_RESULTS.md) are written specifically to be readable without prior frontier-RL context. The composite reward formula, the DeepSeek-R1 KL configuration choice, the sampling-temperature observation on base/SFT comparison β€” all explained from first principles.

Acknowledgements

Built on:

Trained on the UW-Madison CHTC cluster.


Citation

@misc{maheshwari2026verifiable,
  title  = {verifiable-rl-coder: GRPO post-training of small coding LLMs with sandboxed test-execution rewards},
  author = {Maheshwari, Devesh},
  year   = {2026},
  url    = {https://github.com/Devesh-Maheshwari/verifiable-rl-coder}
}