metadata
license: mit
base_model: Qwen/Qwen2.5-7B-Instruct
tags:
- debugging
- tool-use
- multi-turn
- reinforcement-learning
- grpo
datasets:
- custom
language:
- en
pipeline_tag: text-generation
DSL Debug 7B — RL-Only Step 30
Qwen2.5-7B-Instruct trained with GRPO (Group Relative Policy Optimization) directly from the base model, no SFT warmup.
Blog post: Multi-Turn RL for Code Debugging Code + environment: github.com/AndrewLngdn/dsl-debug
Training
- Method: GRPO with multi-turn tool use (verl 0.7 + sglang 0.5.6)
- Base model: Qwen2.5-7B-Instruct (no SFT warmup)
- Steps: 30 (batch size 512, 8 rollouts per prompt)
- LR: 1e-5 cosine
- Reward: Binary (1.0 if submitted code matches expected output, 0.0 otherwise)
- Hardware: 2x A100-SXM4-80GB
Results (held-out test, one-shot)
| Split | Base Model | This Model |
|---|---|---|
| Standard (481) | 50.5% | 78.8% |
| Nonlocal (200) | 12.0% | 54.0% |
| Intent-Mismatch (177) | 0.6% | 14.7% |
Alignment Tax
| Benchmark | Base | This Model |
|---|---|---|
| MMLU (5-shot) | 74.6% | 74.7% |
| GSM8K (8-shot) | 84.9% | 84.4% |
| HumanEval (0-shot) | 65.9% | 59.1% |
Usage
from huggingface_hub import snapshot_download
snapshot_download("andrewlngdn/dsl-debug-7b-rl-only-step30",
local_dir="/workspace/models/rl_only_step30")
Related Models
| Model | Repo |
|---|---|
| SFT then RL step 35 (best) | andrewlngdn/dsl-debug-7b-sft-rl |
| SFT step 100 | andrewlngdn/dsl-debug-7b-sft-step100 |