Qwen3-4B-Instruct-RL

A Qwen3-4B model fine-tuned with GRPO reinforcement learning on tau2-bench tasks.

Overview

This model was trained using Group Relative Policy Optimization (GRPO) on the tau2-bench telecom domain. It builds on the SFT checkpoint and was trained for 95 steps with sparse binary rewards from task completion.

Training Details

  • Base Model: Jarrodbarnes/Qwen3-4B-Instruct-SFT
  • Training Framework: slime with GRPO
  • RL Algorithm: GRPO with KL penalty anchored to base Qwen3-4B-Instruct
  • Training Steps: 95 (100 rollouts, 4 samples per prompt)
  • Learning Rate: 1e-6
  • Reward Source: tau2-bench task completion (binary: 0 or 1)

Training Observations

  • Sparse reward signal (~2% task success rate)
  • Limited gradient flow due to uniform within-batch rewards
  • Model maintains tool-calling format from SFT

Output Format

The model produces tool calls in inline JSON format:

<thinking>Analysis of the customer issue...</thinking>
{"name": "tool_name", "arguments": {"param": "value"}}

Intended Use

  • Research on RL for tool-calling agents
  • Comparison baseline for tau2-bench experiments
  • Study of sparse reward RL training dynamics

Limitations

  • Limited RL improvement due to sparse rewards
  • Optimized for tau2-bench telecom domain only
  • Experimental checkpoint for research purposes

Related Models

Citation

@article{yao2024tau2bench,
  title={tau2-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
  author={Yao, Shunyu and others},
  journal={arXiv preprint arXiv:2506.07982},
  year={2024}
}
Downloads last month
11
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jarrodbarnes/Qwen3-4B-Instruct-RL

Finetuned
(1)
this model

Dataset used to train Jarrodbarnes/Qwen3-4B-Instruct-RL