used GRPO on top of Qwen/Qwen2.5-Math-1.5B model on the gsm8k training set
- used REINFORCE loss with baseline
- used per-sample length normalization for loss aggregation
hyperparameters:
lr = 3e-5
n_grpo_steps = 100
advantage_eps = 1e-6
rollout_batch_size = 256
group_size = 8
gradient_accumulation_steps = 128
epochs_per_rollout_batch = 1
train_batch_size = 256 # on policy since same as rollout_batch_size
use_std_normalization = True
optim = torch.optim.AdamW(
hf_model.parameters(), lr=lr, weight_decay=0.0, betas=(0.9, 0.95)
)
scheduler = LinearLR(optim, start_factor=1.0, end_factor=0.1, total_iters=n_grpo_steps)
prompt template:
A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.
User: {question}
Assistant: <think>
reward format checks for proper use of <think></think> and <answer></answer> tags
performance on GSM8K test set
correct format: 1172/1319
correct reward: 966/1319
- Downloads last month
- 5