Papers
arxiv:2512.22238

Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

Published on Dec 23
· Submitted by
Byung-Kwan Lee
on Dec 29
Authors:
,
,

Abstract

Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers. However, distilling knowledge from a large teacher to a small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher by gradually increasing its capacity during training. This strategy allows the student to learn richer representations from the teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of transferring responses from teacher to student. Unlike online think-answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling students to achieve strong performance without requiring the think-answer process.

Community

Paper submitter

Key Results:

  • Performance Boost: Average of 13 benchmarks gains after applying Masters: Qwen3-VL-8B: 75.7%→80.4% and InternVL3.5-8B: 75.4%→80.0%.
  • Mid-size Teacher before Large one: Progressive teacher-size scaling helps: for InternVL3.5-8B, adding mid-size teachers before the large one lifts the Avg from 77.5 to 80.0 (+2.5).
  • Training Efficiency: offline RL finishes in “about 2 days” on the described setup, whereas comparable online RL would take “30+ days on 256 A100s” for 1.5M samples.
  • Inference Efficiency: Compared to think-answer models (e.g., GLM-4.1V-9B, Keye-VL-1.5-8B, Qwen3-VL-8B-Thinking, InternVL3.5-8B), Masters maintains high performance without incurring extended multi-step reasoning latency.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.22238 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.22238 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.22238 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.