You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Qwen3-LABD-GRPO Series (Self-Correcting Coding Agents)

This model card covers the series of models trained for the Loop-Driven Agentic Behavior Distillation (LABD) graduation project. These models are specifically fine-tuned to function as autonomous coding agents capable of iterative self-correction using execution feedback.

Model Summary

The Qwen3-4B-LABD-GRPO is part of a scaling sweep (0.6B to 8B) designed to bridge the "Reasoning Cliff" in Small Language Models (SLMs). While standard models often fail to recover after an initial incorrect code generation, this model has been trained to perceive execution errors as signals for repair.

Key Capabilities

  • Closed-Loop Reasoning: Structures output using <think>, <execute>, and <feedback> tags.
  • Autonomous Repair: Analyzes Tracebacks and logical assertion failures to generate revised code.
  • Scaling Efficiency: Leverages pre-learned agentic structures to improve recovery rates.

Training Procedure

The training of this series followed a rigorous two-stage post-training recipe:

Stage 1: Loop-Driven Agentic Behavior Distillation (LABD)

We initialized the model with the structure of self-correction. Using Failure-Induced Trajectory Generation, we distilled trajectories where a weak student model failed, and a strong teacher repaired the code. This taught the model how to behave in a loop (Plan → Execute → Observe → Recover) rather than just what the final answer should be.

Stage 2: Group Relative Policy Optimization (GRPO)

To ground the behavioral structure in functional correctness, we applied GRPO. Unlike standard RLHF, GRPO allowed us to normalize rewards within a group of sampled outputs.

  • Verifiable Rewards: The model received rewards (+3.0) for passing unit tests and penalties (-1.0) for malformed code or hallucinated feedback (-2.0).
  • Optimization: Training was performed using LoRA on a single consumer-grade GPU (L4/L40S).

Intended Use

  • Agentic Workflows: Best suited for environments where the model can interact with a Python interpreter.
  • Research: Ideal for studying self-correction, reinforcement learning, and the scaling laws of agentic behavior.

Limitations and Bias

  • Capacity Threshold: Models below 4B parameters may show the correct "behavior" (trying to fix code) but may lack the raw algorithmic knowledge to succeed in the final repair.
  • Python-Centric: Optimization was focused on Python; performance in other languages is not guaranteed.

Performance: Qwen3-4B

The 4B model marks the "Phase Transition" where agentic loops become a net positive over single-pass base models.

  • MBPP Iter-3: 72.40%
  • HumanEval Iter-3: 82.32% (+20.3% Absolute Gain over Base Qwen3-4B)
  • Observation: Above 4B parameters, the model has sufficient representational capacity to fully exploit the LABD training.

Citation

@article{eldegwy2026labd,
  title={Loop-Driven Agentic Behavior Distillation for Self-Correcting Code Generation},
  author={Moaz Eldegwy},
  year={2026},
  journal={Graduation Project: Self-Correction Agent in Coding}
}
Downloads last month
28
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for moazeldegwy/Qwen3-4B-LABD-GRPO

Finetuned
(1)
this model

Collection including moazeldegwy/Qwen3-4B-LABD-GRPO