mshahidul
Initial commit of readCtrl code without large models
030876e

Guide to Using MTP in SFT/RL Training and Inference

Author: https://github.com/meituan-search

Last updated: 01/30/2026

1. Scope of Support

Currently, RL training can be performed on mimo-7B-RL, Qwen-next, and Deepseek series models based on the MTP architecture. The support rules for training and inference engines are as follows:

2. MTP Training Configuration (Core Parameters)

The MTP training process can be flexibly controlled through the following configurations. All configurations are based on the actor_rollout_ref.model.mtp prefix:

Configuration Scenario Core Parameters Description
Load MTP Parameters Only enable=True VRAM usage will increase, but the exported parameters include the MTP module and can be directly used for online deployment
Full-Parameter MTP Training enable=True
enable_train=True
mtp_loss_scaling_factor=0.1
MTP Loss will apply to all model parameters
MTP Parameter-Only Training enable=True
enable_train=True
detach_encoder=True
Freeze the Encoder layer, update only MTP module parameters, MTP Loss applies only to MTP parameters
MTP Accelerated Rollout 1. vLLM configuration:
enable=True
enable_rollout=True
method="mtp"
num_speculative_tokens=1
2. SGLang configuration:
enable=True
enable_rollout=True
speculative_algorithm="EAGLE"
speculative_num_steps=2
speculative_eagle_topk=2
speculative_num_draft_tokens=4
Achieve inference acceleration during the Rollout phase based on MTP

3. Experimental Results

The experiment was conducted as follows:

  • model = mimo-7B-math
  • max_response_length = 8k

Experiment chart:

fully_async_policy_revenue

The wandb link for the graph: wandb

Scenarios with No Significant Effect

The following configurations will not have a noticeable impact on training results:

  1. The base model does not carry MTP parameters;

  2. The base model carries MTP parameters, but the MTP module is not trained;

  3. The base model carries MTP parameters and trains MTP, with mtp_loss_scaling_factor=0;

  4. The base model carries MTP parameters, trains MTP and detaches the encoder, with mtp_loss_scaling_factor=0.1.

Scenarios with Significant Effect

Only the following configuration will have a noticeable impact on training results:

  • The base model carries MTP parameters, MTP Loss applies to all model parameters, and mtp_loss_scaling_factor=0.1.

Recommended Training Method

It is recommended to adopt the detach_encoder=True approach for MTP training.

4. Performance Notes for MTP in Rollout Inference

The effectiveness of MTP-accelerated Rollout is significantly affected by model size and inference hardware. Key reference information is as follows:

Hardware Tensor Core Performance

Hardware Model FP16 Performance (TFLOPS)
H20 148
H800 1,671
H200 1,979

Measured Performance and Recommendations

Taking the mimo-7B model deployed separately on H20 hardware using SGLang as an example: After enabling MTP speculative decoding, the Rollout throughput decreases by approximately 50%.

  • Current priority recommendation: Do not enable MTP acceleration during the inference phase for now;

  • Future planning: Further optimization of the speculative logic in the Rollout phase will be conducted to improve throughput performance.

5. SFT training

The SFT training with MTP is supported, using the same MTP training configuration as RL training.

An example configuration for running SFT can be found in examples/sft/gsm8k/run_mimo_megatron_mtp.sh

SFT result

The experiment was conducted using following data:

  • model = mimo-7B-math
  • dataset = gsm8k

The result: wandb link

The presence of mtp layer has limited effect on main loss. However, when MTP layer is detached, the mtp_loss converges to a higher value.