MaxRL + Curriculum, Qwen3-1.7B-Base, 200-step retrain

This model is the final-step (step 200) policy from a fresh-from-base re-training of the Condition D ablation in Cao 2026, "Disentangling MaxRL and Curriculum Learning for LLM Post-Training".

It corresponds to the entry "(D) MaxRL+Curriculum (200 steps)" in Table 1 of that paper.

Why a new repo?

An earlier release at Sean13/maxrl_curriculum_Qwen3-1.7B was an intermediate (step-150) checkpoint that, when reloaded through our verl rollout pipeline, evaluated substantially below the base model on both AIME-25 and MATH-500 (AIME mean@32 = 0.004 vs. base 0.037; MATH mean@32 = 0.124 vs. base 0.566), suggesting either an upload-time corruption or a chat-template / format mismatch. We re-trained from scratch on the same hyperparameters and publish the result here. The original file is preserved at the old URL for archival reference; use this repo for any downstream evaluation.

Training setup


Base model	`Qwen/Qwen3-1.7B-Base`
RL framework	verl 0.4.0.dev (FSDP backend)
Algorithm	MaxRL (Tajwar et al. 2026, arXiv:2602.02710)
Curriculum	Gaussian-weighted sampler, $\mu=0.5$, $\sigma=0.15$, on Polaris-53K with `pass_rate` from the dataset's built-in `difficulty` column
Train data	`POLARIS-Project/Polaris-Dataset-53K` (53 291 problems)
Eval data	`math-ai/aime25`, `HuggingFaceH4/MATH-500`
Total steps	200 (≈ 1 epoch over the curriculum, batch 64 × 16 rollouts)
Learning rate	1e-6
Hardware	1 × node, 8 × NVIDIA H200, MIT ORCD `ou_sloan_gpu`
Wall clock	6 h 31 min (incl. val-before-train + 4 mid-train val + final val)
Sampling at val	$T=0.6$, top-p $=0.95$, top-k off, $n=32$

Evaluation

Validation metrics over the 200-step run (every 50 steps + final):

Step	MATH-500 mean@32	MATH-500 best@32	AIME-25 mean@32	AIME-25 best@32
0 (base)	0.566	0.884	0.037	0.287
50	0.590	0.899	0.035	0.257
100	0.624	0.886	0.043	0.223
150	0.645	0.894	0.036	0.248
200	0.635	0.892	0.032	0.170

For comparison, the GRPO+Curriculum (Condition C) run at step 208 reaches MATH mean@32 = 0.665 / best@32 = 0.891 / AIME mean@32 = 0.054 / best@32 = 0.264.

Honest caveats

The paper's earlier intermediate (step-150) D number for AIME best@32 (0.323) does not reproduce here. We attribute that to the warm-start checkpoint format issue noted above and treat the 0.323 number as retracted.
Single seed, single hardware run; AIME-25 has only 30 problems and 32-sample variance is large. Differences below ±0.05 in AIME best@32 should not be over-interpreted as a real algorithmic effect.
The MaxRL paper trains for 1000 steps; our compute budget allowed 200. The diversity advantages of MaxRL reported in that paper may emerge later than the horizon studied here.

Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

mid = "Sean13/maxrl_curriculum_Qwen3-1.7B-200step"
tok = AutoTokenizer.from_pretrained(mid, trust_remote_code=True)
m = AutoModelForCausalLM.from_pretrained(mid, torch_dtype=torch.bfloat16,
                                         trust_remote_code=True).cuda()

prompt = ("Solve the equation: 2x + 3 = 11.\n"
          "Please reason step by step, and put your final answer within \\boxed{}.")
out = m.generate(**tok(prompt, return_tensors="pt").to("cuda"),
                 max_new_tokens=512, do_sample=True, temperature=0.6, top_p=0.95)
print(tok.decode(out[0], skip_special_tokens=True))

Citation

If you use this model, please cite both Tajwar et al. (the MaxRL objective) and our ablation paper.

Downloads last month: 5

Safetensors

Model size

2B params

Tensor type

BF16

Video Preview

Reinforcement Learning

Model tree for Sean13/maxrl_curriculum_Qwen3-1.7B-200step

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

(431)

this model

Dataset used to train Sean13/maxrl_curriculum_Qwen3-1.7B-200step

Paper for Sean13/maxrl_curriculum_Qwen3-1.7B-200step

Maximum Likelihood Reinforcement Learning

Paper • 2602.02710 • Published Feb 2 • 4