Condition B — MaxRL alone (100 RL steps, no curriculum)

Final-step (step 100) policy from Condition B of the 2×2 ablation in Cao 2026 disentangling MaxRL and curriculum learning.


Base	`Qwen/Qwen3-1.7B-Base`
Algorithm	MaxRL (`actor_rollout_ref.algorithm.adv_estimator=maxrl`)
Curriculum	OFF (uniform sampling over POLARIS-53K)
Steps	100
Hardware	4×H200 on mit_preemptable
Wall clock	3 h 40 min

Validation @ step 100 (32 rollouts/problem, T=0.6, top-p 0.95)

Metric	Value
MATH-500 mean@32	0.623
MATH-500 best@32	0.898
AIME-25 mean@32	0.044
AIME-25 best@32	0.276

Comparison vs other cells of the 2×2

	MATH P@1	MATH P@32	AIME P@1	AIME P@32
Base	0.566	0.884	0.037	0.287
(A) GRPO no-curr	(pending)	(pending)	(pending)	(pending)
(B) MaxRL no-curr	0.623	0.898	0.044	0.276
(C) GRPO + curr	0.643	0.892	0.045	0.244
(D) MaxRL + curr	0.624	0.886	0.043	0.223

Note: at the 100-step horizon on 1.7B, B (MaxRL alone) beats both C and D on AIME Pass@32, suggesting MaxRL's $1/\hat{r}$ weighting alone is the active ingredient for diversity in this regime, and the explicit Gaussian curriculum sampler does not add to it (and may slightly hurt it).

Downloads last month: 4

Safetensors

Model size

2B params

Tensor type

BF16

Video Preview

Reinforcement Learning

Model tree for Sean13/maxrl_nocurriculum_Qwen3-1.7B-100step

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

(431)

this model

Sean13
/

maxrl_nocurriculum_Qwen3-1.7B-100step

Condition B — MaxRL alone (100 RL steps, no curriculum)

Validation @ step 100 (32 rollouts/problem, T=0.6, top-p 0.95)

Comparison vs other cells of the 2×2

Model tree for Sean13/maxrl_nocurriculum_Qwen3-1.7B-100step

Dataset used to train Sean13/maxrl_nocurriculum_Qwen3-1.7B-100step