DSL Debug 7B — RL-Only Step 30

Qwen2.5-7B-Instruct trained with GRPO (Group Relative Policy Optimization) directly from the base model on the DSL Debug benchmark.

Training

  • Method: GRPO with multi-turn tool use (verl 0.7 + sglang 0.5.6)
  • Base model: Qwen2.5-7B-Instruct (no SFT warmup)
  • Steps: 30 (batch size 512, 8 rollouts per prompt)
  • LR: 1e-5 cosine
  • Reward: Binary (1.0 if submitted code matches expected output, 0.0 otherwise)
  • Hardware: 2x A100-SXM4-80GB

Results (held-out test, one-shot)

Split Base Model This Model
Standard (481) 50.5% 78.8%
Nonlocal (200) 12.0% 54.0%
Intent-Mismatch (177) 0.6% 14.7%

Alignment Tax (general capabilities)

Benchmark Base This Model
MMLU 74.6% 74.7%
GSM8K 84.9% 84.4%
HumanEval 65.9% 59.1%

Usage

from huggingface_hub import snapshot_download
snapshot_download("andrewlngdn/dsl-debug-7b-rl-only-step30",
    local_dir="/workspace/models/rl_only_step30")

See the collection for all models including the stronger SFT→RL variant.

Downloads last month
22
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for andrewlngdn/dsl-debug-7b-rl-only-step30

Base model

Qwen/Qwen2.5-7B
Finetuned
(2852)
this model

Collection including andrewlngdn/dsl-debug-7b-rl-only-step30