DSL Debug 7B — RL-Only Step 30

Qwen2.5-7B-Instruct trained with GRPO (Group Relative Policy Optimization) directly from the base model, no SFT warmup.

Blog post: Multi-Turn RL for Code Debugging Code + environment: github.com/AndrewLngdn/dsl-debug

Training

Method: GRPO with multi-turn tool use (verl 0.7 + sglang 0.5.6)
Base model: Qwen2.5-7B-Instruct (no SFT warmup)
Steps: 30 (batch size 512, 8 rollouts per prompt)
LR: 1e-5 cosine
Reward: Binary (1.0 if submitted code matches expected output, 0.0 otherwise)
Hardware: 2x A100-SXM4-80GB

Results (held-out test, one-shot)

Split	Base Model	This Model
Standard (481)	50.5%	78.8%
Nonlocal (200)	12.0%	54.0%
Intent-Mismatch (177)	0.6%	14.7%

Alignment Tax

Benchmark	Base	This Model
MMLU (5-shot)	74.6%	74.7%
GSM8K (8-shot)	84.9%	84.4%
HumanEval (0-shot)	65.9%	59.1%

Usage

from huggingface_hub import snapshot_download
snapshot_download("andrewlngdn/dsl-debug-7b-rl-only-step30",
    local_dir="/workspace/models/rl_only_step30")