Qwen3-4B-Instruct-RL

A Qwen3-4B model fine-tuned with GRPO reinforcement learning on tau2-bench tasks.

Overview

This model was trained using Group Relative Policy Optimization (GRPO) on the tau2-bench telecom domain. It builds on the SFT checkpoint and was trained for 95 steps with sparse binary rewards from task completion.

Training Details

Base Model: Jarrodbarnes/Qwen3-4B-Instruct-SFT
Training Framework: slime with GRPO
RL Algorithm: GRPO with KL penalty anchored to base Qwen3-4B-Instruct
Training Steps: 95 (100 rollouts, 4 samples per prompt)
Learning Rate: 1e-6
Reward Source: tau2-bench task completion (binary: 0 or 1)

Training Observations

Sparse reward signal (~2% task success rate)
Limited gradient flow due to uniform within-batch rewards
Model maintains tool-calling format from SFT

Output Format

The model produces tool calls in inline JSON format:

<thinking>Analysis of the customer issue...</thinking>
{"name": "tool_name", "arguments": {"param": "value"}}

Intended Use

Research on RL for tool-calling agents
Comparison baseline for tau2-bench experiments
Study of sparse reward RL training dynamics

Limitations

Limited RL improvement due to sparse rewards
Optimized for tau2-bench telecom domain only
Experimental checkpoint for research purposes

Related Models

Jarrodbarnes/Qwen3-4B-Instruct-SFT - SFT base model

Citation

@article{yao2024tau2bench,
  title={tau2-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
  author={Yao, Shunyu and others},
  journal={arXiv preprint arXiv:2506.07982},
  year={2024}
}

Downloads last month: 11

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for Jarrodbarnes/Qwen3-4B-Instruct-RL

Base model

Jarrodbarnes/Qwen3-4B-Instruct-SFT

Finetuned

(1)

this model

Jarrodbarnes
/

Qwen3-4B-Instruct-RL