Qwen3-4B-Instruct-RL
A Qwen3-4B model fine-tuned with GRPO reinforcement learning on tau2-bench tasks.
Overview
This model was trained using Group Relative Policy Optimization (GRPO) on the tau2-bench telecom domain. It builds on the SFT checkpoint and was trained for 95 steps with sparse binary rewards from task completion.
Training Details
- Base Model: Jarrodbarnes/Qwen3-4B-Instruct-SFT
- Training Framework: slime with GRPO
- RL Algorithm: GRPO with KL penalty anchored to base Qwen3-4B-Instruct
- Training Steps: 95 (100 rollouts, 4 samples per prompt)
- Learning Rate: 1e-6
- Reward Source: tau2-bench task completion (binary: 0 or 1)
Training Observations
- Sparse reward signal (~2% task success rate)
- Limited gradient flow due to uniform within-batch rewards
- Model maintains tool-calling format from SFT
Output Format
The model produces tool calls in inline JSON format:
<thinking>Analysis of the customer issue...</thinking>
{"name": "tool_name", "arguments": {"param": "value"}}
Intended Use
- Research on RL for tool-calling agents
- Comparison baseline for tau2-bench experiments
- Study of sparse reward RL training dynamics
Limitations
- Limited RL improvement due to sparse rewards
- Optimized for tau2-bench telecom domain only
- Experimental checkpoint for research purposes
Related Models
- Jarrodbarnes/Qwen3-4B-Instruct-SFT - SFT base model
Citation
@article{yao2024tau2bench,
title={tau2-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
author={Yao, Shunyu and others},
journal={arXiv preprint arXiv:2506.07982},
year={2024}
}
- Downloads last month
- 11
Model tree for Jarrodbarnes/Qwen3-4B-Instruct-RL
Base model
Jarrodbarnes/Qwen3-4B-Instruct-SFT