arxiv:2606.30406

MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

Published on Jun 29

· Submitted by

Adina Yakefu on Jul 1

Xiaomi MiMo

Upvote

Authors:

Wenhan Ma ,

Fuli Luo

Abstract

Multi-teacher On-Policy Distillation (MOPD) enables efficient integration of multiple domain capabilities in large language models through specialized reinforcement learning teachers and on-policy distillation, achieving superior performance over existing methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Modern large language models (LLMs) rely on reinforcement learning during post-training to push specific capabilities, yet integrating multiple capabilities into one model remains hard. Existing methods, such as Off-Policy Finetune and Mix-RL, are either inefficient or lose performance. In this work, we propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm for combining the capabilities of multiple domain RL teachers: we first run per-domain specialised RL to obtain a set of domain teachers, then distill these teachers into the student on its own rollouts. This eliminates exposure bias and provides a dense optimization signal. On Qwen3-30B-A3B, MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines, inheriting nearly all of each teacher's capability. MOPD also enables parallel, independent development of domain teachers, removing the cross-domain coupling typical of multi-domain post-training. MOPD has been deployed in the post-training of MiMo-V2-Flash, an industrial-scale frontier model, demonstrating its practical value for capability integration in frontier-scale LLMs.

View arXiv page View PDF Add to collection

Community

AdinaY

Paper submitter about 6 hours ago

MOPD is a post-training method that combines multiple RL-trained domain teachers into one model via on-policy distillation. It reduces exposure bias, improves efficiency, and outperforms existing multi-domain training methods on Qwen3-30B-A3B. It has also been applied in MiMo-V2-Flash at industrial scale.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.30406

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.30406 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.30406 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.30406 in a Space README.md to link it from this page.