MaxRL + Curriculum, Qwen3-1.7B-Base, 200-step retrain
This model is the final-step (step 200) policy from a fresh-from-base re-training of the Condition D ablation in Cao 2026, "Disentangling MaxRL and Curriculum Learning for LLM Post-Training".
It corresponds to the entry "(D) MaxRL+Curriculum (200 steps)" in Table 1 of that paper.
Why a new repo?
An earlier release at
Sean13/maxrl_curriculum_Qwen3-1.7B
was an intermediate (step-150) checkpoint that, when reloaded through our verl rollout
pipeline, evaluated substantially below the base model on both AIME-25 and MATH-500
(AIME mean@32 = 0.004 vs. base 0.037; MATH mean@32 = 0.124 vs. base 0.566), suggesting
either an upload-time corruption or a chat-template / format mismatch. We re-trained
from scratch on the same hyperparameters and publish the result here. The original
file is preserved at the old URL for archival reference; use this repo for any
downstream evaluation.
Training setup
| Base model | Qwen/Qwen3-1.7B-Base |
| RL framework | verl 0.4.0.dev (FSDP backend) |
| Algorithm | MaxRL (Tajwar et al. 2026, arXiv:2602.02710) |
| Curriculum | Gaussian-weighted sampler, $\mu=0.5$, $\sigma=0.15$, on Polaris-53K with pass_rate from the dataset's built-in difficulty column |
| Train data | POLARIS-Project/Polaris-Dataset-53K (53 291 problems) |
| Eval data | math-ai/aime25, HuggingFaceH4/MATH-500 |
| Total steps | 200 (≈ 1 epoch over the curriculum, batch 64 × 16 rollouts) |
| Learning rate | 1e-6 |
| Hardware | 1 × node, 8 × NVIDIA H200, MIT ORCD ou_sloan_gpu |
| Wall clock | 6 h 31 min (incl. val-before-train + 4 mid-train val + final val) |
| Sampling at val | $T=0.6$, top-p $=0.95$, top-k off, $n=32$ |
Evaluation
Validation metrics over the 200-step run (every 50 steps + final):
| Step | MATH-500 mean@32 | MATH-500 best@32 | AIME-25 mean@32 | AIME-25 best@32 |
|---|---|---|---|---|
| 0 (base) | 0.566 | 0.884 | 0.037 | 0.287 |
| 50 | 0.590 | 0.899 | 0.035 | 0.257 |
| 100 | 0.624 | 0.886 | 0.043 | 0.223 |
| 150 | 0.645 | 0.894 | 0.036 | 0.248 |
| 200 | 0.635 | 0.892 | 0.032 | 0.170 |
For comparison, the GRPO+Curriculum (Condition C) run at step 208 reaches MATH mean@32 = 0.665 / best@32 = 0.891 / AIME mean@32 = 0.054 / best@32 = 0.264.
Honest caveats
- The paper's earlier intermediate (step-150) D number for AIME best@32 (0.323) does not reproduce here. We attribute that to the warm-start checkpoint format issue noted above and treat the 0.323 number as retracted.
- Single seed, single hardware run; AIME-25 has only 30 problems and 32-sample variance is large. Differences below ±0.05 in AIME best@32 should not be over-interpreted as a real algorithmic effect.
- The MaxRL paper trains for 1000 steps; our compute budget allowed 200. The diversity advantages of MaxRL reported in that paper may emerge later than the horizon studied here.
Inference
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
mid = "Sean13/maxrl_curriculum_Qwen3-1.7B-200step"
tok = AutoTokenizer.from_pretrained(mid, trust_remote_code=True)
m = AutoModelForCausalLM.from_pretrained(mid, torch_dtype=torch.bfloat16,
trust_remote_code=True).cuda()
prompt = ("Solve the equation: 2x + 3 = 11.\n"
"Please reason step by step, and put your final answer within \\boxed{}.")
out = m.generate(**tok(prompt, return_tensors="pt").to("cuda"),
max_new_tokens=512, do_sample=True, temperature=0.6, top_p=0.95)
print(tok.decode(out[0], skip_special_tokens=True))
Citation
If you use this model, please cite both Tajwar et al. (the MaxRL objective) and our ablation paper.
- Downloads last month
- 17
Model tree for Sean13/maxrl_curriculum_Qwen3-1.7B-200step
Base model
Qwen/Qwen3-1.7B-Base