Instructions to use moazeldegwy/Qwen3-4B-LABD-GRPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moazeldegwy/Qwen3-4B-LABD-GRPO with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="moazeldegwy/Qwen3-4B-LABD-GRPO")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("moazeldegwy/Qwen3-4B-LABD-GRPO")
model = AutoModelForCausalLM.from_pretrained("moazeldegwy/Qwen3-4B-LABD-GRPO")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use moazeldegwy/Qwen3-4B-LABD-GRPO with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moazeldegwy/Qwen3-4B-LABD-GRPO"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moazeldegwy/Qwen3-4B-LABD-GRPO",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/moazeldegwy/Qwen3-4B-LABD-GRPO

SGLang

How to use moazeldegwy/Qwen3-4B-LABD-GRPO with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moazeldegwy/Qwen3-4B-LABD-GRPO" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moazeldegwy/Qwen3-4B-LABD-GRPO",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moazeldegwy/Qwen3-4B-LABD-GRPO" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moazeldegwy/Qwen3-4B-LABD-GRPO",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use moazeldegwy/Qwen3-4B-LABD-GRPO with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for moazeldegwy/Qwen3-4B-LABD-GRPO to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for moazeldegwy/Qwen3-4B-LABD-GRPO to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for moazeldegwy/Qwen3-4B-LABD-GRPO to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="moazeldegwy/Qwen3-4B-LABD-GRPO",
    max_seq_length=2048,
)

Docker Model Runner
How to use moazeldegwy/Qwen3-4B-LABD-GRPO with Docker Model Runner:
```
docker model run hf.co/moazeldegwy/Qwen3-4B-LABD-GRPO
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Qwen3-LABD-GRPO Series (Self-Correcting Coding Agents)

This model card covers the series of models trained for the Loop-Driven Agentic Behavior Distillation (LABD) graduation project. These models are specifically fine-tuned to function as autonomous coding agents capable of iterative self-correction using execution feedback.

Model Summary

The Qwen3-4B-LABD-GRPO is part of a scaling sweep (0.6B to 8B) designed to bridge the "Reasoning Cliff" in Small Language Models (SLMs). While standard models often fail to recover after an initial incorrect code generation, this model has been trained to perceive execution errors as signals for repair.

Key Capabilities

Closed-Loop Reasoning: Structures output using <think>, <execute>, and <feedback> tags.
Autonomous Repair: Analyzes Tracebacks and logical assertion failures to generate revised code.
Scaling Efficiency: Leverages pre-learned agentic structures to improve recovery rates.

Training Procedure

The training of this series followed a rigorous two-stage post-training recipe:

Stage 1: Loop-Driven Agentic Behavior Distillation (LABD)

We initialized the model with the structure of self-correction. Using Failure-Induced Trajectory Generation, we distilled trajectories where a weak student model failed, and a strong teacher repaired the code. This taught the model how to behave in a loop (Plan → Execute → Observe → Recover) rather than just what the final answer should be.

Stage 2: Group Relative Policy Optimization (GRPO)

To ground the behavioral structure in functional correctness, we applied GRPO. Unlike standard RLHF, GRPO allowed us to normalize rewards within a group of sampled outputs.

Verifiable Rewards: The model received rewards (+3.0) for passing unit tests and penalties (-1.0) for malformed code or hallucinated feedback (-2.0).
Optimization: Training was performed using LoRA on a single consumer-grade GPU (L4/L40S).

Intended Use

Agentic Workflows: Best suited for environments where the model can interact with a Python interpreter.
Research: Ideal for studying self-correction, reinforcement learning, and the scaling laws of agentic behavior.

Limitations and Bias

Capacity Threshold: Models below 4B parameters may show the correct "behavior" (trying to fix code) but may lack the raw algorithmic knowledge to succeed in the final repair.
Python-Centric: Optimization was focused on Python; performance in other languages is not guaranteed.

Performance: Qwen3-4B

The 4B model marks the "Phase Transition" where agentic loops become a net positive over single-pass base models.

MBPP Iter-3: 72.40%
HumanEval Iter-3: 82.32% (+20.3% Absolute Gain over Base Qwen3-4B)
Observation: Above 4B parameters, the model has sufficient representational capacity to fully exploit the LABD training.

Citation

@article{eldegwy2026labd,
  title={Loop-Driven Agentic Behavior Distillation for Self-Correcting Code Generation},
  author={Moaz Eldegwy},
  year={2026},
  journal={Graduation Project: Self-Correction Agent in Coding}
}

Downloads last month: 28

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for moazeldegwy/Qwen3-4B-LABD-GRPO

Base model

moazeldegwy/Qwen3-4B-LABD

Finetuned

(1)

this model

Collection including moazeldegwy/Qwen3-4B-LABD-GRPO

Qwen3-LABD

Collection

An agentic framework for self-correcting code generation using Loop-Driven Agentic Behavior Distillation (LABD) and GRPO. This project scales autonomo • 8 items • Updated 25 days ago