MaxRL + Curriculum, Qwen3-1.7B-Base, 200-step retrain

This model is the final-step (step 200) policy from a fresh-from-base re-training of the Condition D ablation in Cao 2026, "Disentangling MaxRL and Curriculum Learning for LLM Post-Training".

It corresponds to the entry "(D) MaxRL+Curriculum (200 steps)" in Table 1 of that paper.

Why a new repo?

An earlier release at Sean13/maxrl_curriculum_Qwen3-1.7B was an intermediate (step-150) checkpoint that, when reloaded through our verl rollout pipeline, evaluated substantially below the base model on both AIME-25 and MATH-500 (AIME mean@32 = 0.004 vs. base 0.037; MATH mean@32 = 0.124 vs. base 0.566), suggesting either an upload-time corruption or a chat-template / format mismatch. We re-trained from scratch on the same hyperparameters and publish the result here. The original file is preserved at the old URL for archival reference; use this repo for any downstream evaluation.

Training setup

Base model Qwen/Qwen3-1.7B-Base
RL framework verl 0.4.0.dev (FSDP backend)
Algorithm MaxRL (Tajwar et al. 2026, arXiv:2602.02710)
Curriculum Gaussian-weighted sampler, $\mu=0.5$, $\sigma=0.15$, on Polaris-53K with pass_rate from the dataset's built-in difficulty column
Train data POLARIS-Project/Polaris-Dataset-53K (53 291 problems)
Eval data math-ai/aime25, HuggingFaceH4/MATH-500
Total steps 200 (≈ 1 epoch over the curriculum, batch 64 × 16 rollouts)
Learning rate 1e-6
Hardware 1 × node, 8 × NVIDIA H200, MIT ORCD ou_sloan_gpu
Wall clock 6 h 31 min (incl. val-before-train + 4 mid-train val + final val)
Sampling at val $T=0.6$, top-p $=0.95$, top-k off, $n=32$

Evaluation

Validation metrics over the 200-step run (every 50 steps + final):

Step MATH-500 mean@32 MATH-500 best@32 AIME-25 mean@32 AIME-25 best@32
0 (base) 0.566 0.884 0.037 0.287
50 0.590 0.899 0.035 0.257
100 0.624 0.886 0.043 0.223
150 0.645 0.894 0.036 0.248
200 0.635 0.892 0.032 0.170

For comparison, the GRPO+Curriculum (Condition C) run at step 208 reaches MATH mean@32 = 0.665 / best@32 = 0.891 / AIME mean@32 = 0.054 / best@32 = 0.264.

Honest caveats

  1. The paper's earlier intermediate (step-150) D number for AIME best@32 (0.323) does not reproduce here. We attribute that to the warm-start checkpoint format issue noted above and treat the 0.323 number as retracted.
  2. Single seed, single hardware run; AIME-25 has only 30 problems and 32-sample variance is large. Differences below ±0.05 in AIME best@32 should not be over-interpreted as a real algorithmic effect.
  3. The MaxRL paper trains for 1000 steps; our compute budget allowed 200. The diversity advantages of MaxRL reported in that paper may emerge later than the horizon studied here.

Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

mid = "Sean13/maxrl_curriculum_Qwen3-1.7B-200step"
tok = AutoTokenizer.from_pretrained(mid, trust_remote_code=True)
m = AutoModelForCausalLM.from_pretrained(mid, torch_dtype=torch.bfloat16,
                                         trust_remote_code=True).cuda()

prompt = ("Solve the equation: 2x + 3 = 11.\n"
          "Please reason step by step, and put your final answer within \\boxed{}.")
out = m.generate(**tok(prompt, return_tensors="pt").to("cuda"),
                 max_new_tokens=512, do_sample=True, temperature=0.6, top_p=0.95)
print(tok.decode(out[0], skip_special_tokens=True))

Citation

If you use this model, please cite both Tajwar et al. (the MaxRL objective) and our ablation paper.

Downloads last month
17
Safetensors
Model size
2B params
Tensor type
BF16
·
Video Preview
loading

Model tree for Sean13/maxrl_curriculum_Qwen3-1.7B-200step

Finetuned
(369)
this model

Dataset used to train Sean13/maxrl_curriculum_Qwen3-1.7B-200step

Paper for Sean13/maxrl_curriculum_Qwen3-1.7B-200step