MRT-online (R1-Distill-Qwen-1.5B)

DeepSeek-R1-Distill-Qwen-1.5B fine-tuned with the on-policy variant of Meta Reinforcement Fine-Tuning (MRT). Where the offline variant uses an off-policy prefix and a single end-of-trace progress bonus, MRT-online generates the reasoning trace on-policy, segments it into episodes online, forks short forced-termination branches at each episode boundary, and assigns a per-episode dense progress reward. This is the "branched rollouts from a meta-prover policy" direction left as an open problem in the paper.

v0.1 / research variant. Produced by the open-source v0.1 training code on miles. This is an exploratory on-policy variant, not the configuration reported in the paper.

Evaluation

pass@1 (mean of 64 samples/problem) at a 16K token budget, averaged over AIME 2024 / AIME 2025 / AMC 2023 / MinervaMATH / MATH500:

model AIME24 AIME25 AMC23 Minerva MATH500 Avg gain over base
base (R1-Distill-Qwen-1.5B) 27.34 22.86 67.89 24.94 81.71 44.95
GRPO (outcome-reward) 28.12 22.97 67.77 26.45 81.85 45.43 +0.48
MRT-online (this model) 28.59 22.24 68.79 25.87 82.37 45.57 +0.62
MRT-offline (for reference) 28.75 23.59 70.86 24.96 82.61 46.16 +1.20

Finding: MRT-online improves over outcome-reward GRPO (+0.62 vs +0.48) but trails the offline single-scalar MRT (+1.20). The on-policy per-episode signal helps, but with only G=4 termination branches per boundary the per-episode credit is higher-variance — empirical support for the off-policy single-scalar form the paper adopts. A larger branch count or longer training may narrow the gap.

Training

  • Base: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B; data: 4,000 NuminaMath problems.
  • On-policy GRPO + per-episode progress reward (α=1.0, ≤6 episodes/rollout, G=4 branches), 248 optimizer steps, 16K budget, temp 0.9.
  • Framework: miles (Megatron-LM + SGLang). Recipe: CMU-AIRe/MRT train/rl/REPRODUCTION.md.

Citation

@misc{qu2025optimizingtesttimecomputemeta,
      title={Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning},
      author={Yuxiao Qu and Matthew Y. R. Yang and Amrith Setlur and Lewis Tunstall and Edward Emanuel Beeching and Ruslan Salakhutdinov and Aviral Kumar},
      year={2025}, eprint={2503.07572}, archivePrefix={arXiv}, primaryClass={cs.LG},
      url={https://arxiv.org/abs/2503.07572},
}
Downloads last month
15
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CMU-AIRe/MRT-online

Finetuned
(640)
this model

Paper for CMU-AIRe/MRT-online