Condition B — MaxRL alone (100 RL steps, no curriculum)

Final-step (step 100) policy from Condition B of the 2×2 ablation in Cao 2026 disentangling MaxRL and curriculum learning.

Base Qwen/Qwen3-1.7B-Base
Algorithm MaxRL (actor_rollout_ref.algorithm.adv_estimator=maxrl)
Curriculum OFF (uniform sampling over POLARIS-53K)
Steps 100
Hardware 4×H200 on mit_preemptable
Wall clock 3 h 40 min

Validation @ step 100 (32 rollouts/problem, T=0.6, top-p 0.95)

Metric Value
MATH-500 mean@32 0.623
MATH-500 best@32 0.898
AIME-25 mean@32 0.044
AIME-25 best@32 0.276

Comparison vs other cells of the 2×2

MATH P@1 MATH P@32 AIME P@1 AIME P@32
Base 0.566 0.884 0.037 0.287
(A) GRPO no-curr (pending) (pending) (pending) (pending)
(B) MaxRL no-curr 0.623 0.898 0.044 0.276
(C) GRPO + curr 0.643 0.892 0.045 0.244
(D) MaxRL + curr 0.624 0.886 0.043 0.223

Note: at the 100-step horizon on 1.7B, B (MaxRL alone) beats both C and D on AIME Pass@32, suggesting MaxRL's $1/\hat{r}$ weighting alone is the active ingredient for diversity in this regime, and the explicit Gaussian curriculum sampler does not add to it (and may slightly hurt it).

Downloads last month
18
Safetensors
Model size
2B params
Tensor type
BF16
·
Video Preview
loading

Model tree for Sean13/maxrl_nocurriculum_Qwen3-1.7B-100step

Finetuned
(369)
this model

Dataset used to train Sean13/maxrl_nocurriculum_Qwen3-1.7B-100step