POLARIS-Project/Polaris-Dataset-53K
Viewer • Updated • 53.3k • 3.38k • 37
Final-step (step 100) policy from Condition B of the 2×2 ablation in Cao 2026 disentangling MaxRL and curriculum learning.
| Base | Qwen/Qwen3-1.7B-Base |
| Algorithm | MaxRL (actor_rollout_ref.algorithm.adv_estimator=maxrl) |
| Curriculum | OFF (uniform sampling over POLARIS-53K) |
| Steps | 100 |
| Hardware | 4×H200 on mit_preemptable |
| Wall clock | 3 h 40 min |
| Metric | Value |
|---|---|
| MATH-500 mean@32 | 0.623 |
| MATH-500 best@32 | 0.898 |
| AIME-25 mean@32 | 0.044 |
| AIME-25 best@32 | 0.276 |
| MATH P@1 | MATH P@32 | AIME P@1 | AIME P@32 | |
|---|---|---|---|---|
| Base | 0.566 | 0.884 | 0.037 | 0.287 |
| (A) GRPO no-curr | (pending) | (pending) | (pending) | (pending) |
| (B) MaxRL no-curr | 0.623 | 0.898 | 0.044 | 0.276 |
| (C) GRPO + curr | 0.643 | 0.892 | 0.045 | 0.244 |
| (D) MaxRL + curr | 0.624 | 0.886 | 0.043 | 0.223 |
Note: at the 100-step horizon on 1.7B, B (MaxRL alone) beats both C and D on AIME Pass@32, suggesting MaxRL's $1/\hat{r}$ weighting alone is the active ingredient for diversity in this regime, and the explicit Gaussian curriculum sampler does not add to it (and may slightly hurt it).
Base model
Qwen/Qwen3-1.7B-Base