used GRPO on top of Qwen/Qwen2.5-Math-1.5B model on the gsm8k training set

used REINFORCE loss with baseline
used per-sample length normalization for loss aggregation

hyperparameters:

lr = 3e-5
n_grpo_steps = 100
advantage_eps = 1e-6
rollout_batch_size = 256
group_size = 8
gradient_accumulation_steps = 128
epochs_per_rollout_batch = 1
train_batch_size = 256 # on policy since same as rollout_batch_size
use_std_normalization = True

optim = torch.optim.AdamW(
    hf_model.parameters(), lr=lr, weight_decay=0.0, betas=(0.9, 0.95)
)
scheduler = LinearLR(optim, start_factor=1.0, end_factor=0.1, total_iters=n_grpo_steps)

prompt template:

A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.
User: {question}
Assistant: <think>

reward format checks for proper use of <think></think> and <answer></answer> tags

performance on GSM8K test set

correct format: 1172/1319
correct reward: 966/1319

Downloads last month: 5

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for michaelbzhu/Qwen2.5-Math-1.5B-GSM8K-GRPO

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-Math-1.5B

Finetuned

(156)

this model

michaelbzhu
/

Qwen2.5-Math-1.5B-GSM8K-GRPO

Model tree for michaelbzhu/Qwen2.5-Math-1.5B-GSM8K-GRPO

Dataset used to train michaelbzhu/Qwen2.5-Math-1.5B-GSM8K-GRPO