Spaces:

dmaheshwar22
/

verifiable-rl-coder

Sleeping

App Files Files Community

verifiable-rl-coder / README.md

dmaheshwar22

Updated Readme file

b0fd550 verified about 1 month ago

preview code

raw

history blame contribute delete

10.3 kB

	---
	title: verifiable-rl-coder
	emoji: 🤖
	colorFrom: blue
	colorTo: green
	sdk: streamlit
	sdk_version: 1.56.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: GRPO-trained Qwen-1.5B coder with sandboxed test execution
	tags:
	- code-generation
	- reinforcement-learning
	- grpo
	- verifiable-rewards
	- lora
	- qwen
	- educational
	- reproducible-research
	---

	# verifiable-rl-coder

	> **Live, side-by-side comparison of a base coding LLM, an SFT fine-tune, and
	> a GRPO-trained model — with the full sandboxed test-execution pipeline
	> running in the browser.**

	This Space is the interactive front-end to a complete open implementation of
	the verifiable-reward RL post-training technique behind DeepSeek-R1, the
	OpenAI o-series, and Kimi-K1.5 — applied to a small open coding model
	(Qwen-2.5-Coder-1.5B). Everything is open: weights, training code, evaluation
	harness, and the multi-week debugging log of what actually broke and how it
	got fixed.

	## Try it

	1. Pick Compare (side-by-side) in the sidebar.
	2. Choose Base + SFT (and GRPO once that's available).
	3. Use a pre-filled example or write your own coding task + assert tests.
	4. Click Generate + run tests — watch each model produce a solution and
	run it against your tests in a sandboxed Python subprocess.

	A few prompts that cleanly differentiate the models:

	- Roman numeral conversion — base often forgets the subtractive-notation
	pairs (IV, IX, XL, XC, CD, CM); SFT learned them
	- Closest-to-zero with tie-breaking — both fail, but in qualitatively
	different ways (base writes structurally invalid code; SFT writes the
	correct algorithm with one inverted comparison)
	- Array rotation with k > len — both miss the modulo; this is exactly
	the kind of edge-case GRPO is designed to catch via test-execution feedback

	Full prompt gallery + reproducible recipes:
	[DEMO_EXAMPLES.md](https://github.com/Devesh-Maheshwari/verifiable-rl-coder/blob/main/docs/DEMO_EXAMPLES.md).

	## What this demonstrates

	\| Concept \| How the Space shows it \|
	\|---\|---\|
	\| Verifiable rewards \| Every generated solution is parsed, executed, and scored by real test runs — visible to the user, not abstracted \|
	\| The SFT → GRPO progression \| Three models in one UI; you see what each stage of post-training adds \|
	\| Reward hacking is real \| Some prompts produce code that "looks right" but fails edge cases — the live sandbox catches it on the spot \|
	\| Small models can be improved \| LoRA-rank-16 SFT on 319 prompts gives +1.1 pts HumanEval+; GRPO targets the remaining shared blind spots \|

	## Technical approach

	```
	┌──────────────────────────────────┐
	│ Streamlit UI (this Space) │
	│ prompt + tests → patches + runs │
	└──────────────┬───────────────────┘
	│
	┌──────────────▼───────────────────┐
	│ Proposer │
	│ Qwen-1.5B / +LoRA-SFT / +GRPO │
	└──────────────┬───────────────────┘
	│
	┌──────────────▼───────────────────┐
	│ Verifier │
	│ subprocess pytest in sandbox │
	│ (5s timeout, isolated workdir) │
	└──────────────┬───────────────────┘
	│ pass/fail + composite reward
	▼
	Offline (training, not in this Space):
	GRPO rollout buffer → reward → group-relative advantage → LoRA update
	```

	Composite reward — `R = 1.0·correctness + 0.05·lint + 0.05·runtime + 0.01·length`.
	Correctness uses real test execution (binary pass/fail in `[0, 1]`). The 20×
	weight ratio between correctness and each auxiliary signal mechanically
	prevents reward hacking via short-but-wrong code or lint-clean stubs. Full
	breakdown:
	[REWARD_DESIGN.md](https://github.com/Devesh-Maheshwari/verifiable-rl-coder/blob/main/docs/REWARD_DESIGN.md).

	Training data — 319 MBPP-train prompts contamination-filtered against
	MBPP+ test set, 2,580 rejection-sampled solutions kept after sandboxed test
	execution.

	KL configuration — DeepSeek-R1 style (KL added to loss, not reward),
	`kl_loss_coef = 0.04`, `kl_loss_type = low_var_kl`. Tighter than R1's 0.001
	default as a defensive choice given the small training set.

	## Results (current snapshot)

	Evaluated with [evalplus](https://github.com/evalplus/evalplus) at
	temperature 0.2, n=5 samples per task.

	\| Model \| HumanEval+ pass@1 \| HumanEval+ pass@5 \|
	\|---\|---\|---\|
	\| Base Qwen-2.5-Coder-1.5B \| 0.6268 \| 0.7073 \|
	\| LoRA SFT (this work) \| 0.6378 \| 0.6951 \|
	\| GRPO (training in progress) \| TBD \| TBD \|

	The SFT delta is statistically modest (~3.8 pt noise floor for n=164) —
	documented honestly in the model card. The qualitative analysis in
	DEMO_EXAMPLES.md shows where the gain comes from: targeted improvements on
	problems requiring specific structured-knowledge patterns (Roman numeral
	subtractive notation, edge-case-aware list operations), with non-destructive
	behavior on the ~70% of problems where base was already correct.

	## Open artifacts

	- SFT model: [dmaheshwar22/qwen-1.5b-coder-sft-v1](https://huggingface.co/dmaheshwar22/qwen-1.5b-coder-sft-v1)
	(LoRA adapter, 17 MB, full model card with training data + hyperparameters)
	- GRPO model: [dmaheshwar22/qwen-1.5b-coder-grpo-v1](https://huggingface.co/dmaheshwar22/qwen-1.5b-coder-grpo-v1)
	(will go live once 200-step training completes)
	- Source code: [github.com/Devesh-Maheshwari/verifiable-rl-coder](https://github.com/Devesh-Maheshwari/verifiable-rl-coder)
	- Training infrastructure: HTCondor submit scripts, verl + vLLM training
	config, sandbox runner with seccomp / resource limits — all in the repo
	- Debugging log: [grpo-chtc-debugging-log.md](https://github.com/Devesh-Maheshwari/verifiable-rl-coder/blob/main/docs/grpo-chtc-debugging-log.md)
	documents the 12 distinct failure modes hit while getting verl + CHTC + vLLM
	+ Hydra to actually run a training step. Most ML projects hide this; we
	publish it because it's the most useful artifact for someone trying to
	reproduce the work.

	## Why ZeroGPU would meaningfully improve this Space

	Running on CPU basic, generating a single 512-token response takes 30–60
	seconds. The side-by-side compare mode triggers two such generations
	sequentially — so a recruiter or researcher exploring the demo waits
	~90 seconds per click. That latency throws away the demo's actual value:
	you can't feel the model differences when each comparison takes minutes.

	ZeroGPU would change this from "leave a tab open and check back" to
	"interactive exploration." A T4 / A10 with vLLM does 1.5B inference at
	~50 tokens/sec — generations land in 2–4 seconds. The user can run the
	full DEMO_EXAMPLES.md gallery in 5 minutes instead of 45.

	This particularly matters for the comparison-driven nature of this work.
	The whole pitch is "see how SFT and GRPO change behavior on the same prompt"
	— that observation is qualitative and requires multiple side-by-side runs.
	Slow inference makes it impractical at any scale.

	## Limitations (honest)

	- 1.5B parameters — competent on isolated functions, weak on multi-file
	repositories or large-context reasoning. Don't expect SWE-bench wins.
	- 319-prompt training set — small; gains are bounded; we surface this
	explicitly in REWARD_DESIGN.md rather than oversell.
	- MBPP-shape distribution — model is best on problems matching its
	training distribution (algorithmic Python functions with assert tests).
	Less reliable for systems code, async, or competitive-programming-heavy
	problems.
	- Inherits Qwen-2.5-Coder base properties — including any biases or
	safety properties of the upstream model.
	- CPU inference is slow — see "Why ZeroGPU" above.

	## Educational value

	This Space + the connected GitHub repo + the model card together form a
	complete reference implementation of small-scale verifiable-reward RL
	post-training. Specifically useful for:

	- Researchers / students who want to read the full pipeline end-to-end
	without paywalls or proprietary internals
	- Engineers studying how reward hacking is prevented mechanically
	(weight ratios in composite reward, KL configuration, length monitoring)
	- Anyone investigating why small fine-tunes plateau and what GRPO is
	designed to fix beyond imitation learning

	The connected docs (REWARD_DESIGN.md, DEMO_EXAMPLES.md, the debugging log,
	ablation tables in EVAL_RESULTS.md) are written specifically to be readable
	without prior frontier-RL context. The composite reward formula, the
	DeepSeek-R1 KL configuration choice, the sampling-temperature observation
	on base/SFT comparison — all explained from first principles.

	## Acknowledgements

	Built on:
	- [Qwen-2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) — base model
	- [verl](https://github.com/volcengine/verl) — production GRPO trainer
	- [vLLM](https://github.com/vllm-project/vllm) — fast rollout engine
	- [TRL](https://github.com/huggingface/trl) + [PEFT](https://github.com/huggingface/peft) — SFT + LoRA
	- [evalplus](https://github.com/evalplus/evalplus) — robust HumanEval+/MBPP+ evaluation
	- [DeepSeek-R1](https://arxiv.org/abs/2501.12948) — methodology reference

	Trained on the [UW-Madison CHTC](https://chtc.cs.wisc.edu/) cluster.

	---

	Citation

	```bibtex
	@misc{maheshwari2026verifiable,
	title = {verifiable-rl-coder: GRPO post-training of small coding LLMs with sandboxed test-execution rewards},
	author = {Maheshwari, Devesh},
	year = {2026},
	url = {https://github.com/Devesh-Maheshwari/verifiable-rl-coder}
	}
	```