braindeck
Initial commit
bcdf9fa

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

verl Megatron-Core Models

The earlier versions of verl use Megatron-LM 0.4 and workaround huggingface model classes. To better use the latest features and speedup of modern Megatron, we are migrating to Megatron-Core(mcore), and use the recommended GPTModel class for all language models. With mcore GPTModel, we can use the latest features like context parallel, expert parallel, dist_checkpointing, etc. and we can update mcore with little effort in the future for new features.

The migration has been successful with the help of the mcore team and the community. What we have done is:

  1. update Megatron version to 0.11.0
  2. migrate LlamaForCausalLM and Qwen2ForCausalLM to mcore GPTModel
  3. support sequence packing/thd format.
  4. support tensor parallel, pipeline parallel, sequence parallel, virtual pipeline parallel, context parallel.
  5. support the mcore dist_checkpointing feature and a basic offline weighs conversion scipt from huggingface to mcore dist_checkpointing format.

We are working on the following features:

  • support Qwen2MoeForCausalLM
  • support MixtralForCausalLM
  • support DeepseekV3ForCausalLM
  • support expert parallel

Features we invite the community to contribute:

  • better scipts for offline weights conversion from huggingface to mcore dist_checkpointing format.
    • conversion of large models with multiple GPUs
    • conversion of large models with single GPU
  • refactor the megatron_checkpoint_manager.py by dist_checkpointing format.
  • support llama4
  • support qwen2.5-vl

To track the progress of verl mcore integration, please refer to the mcore integration issue.

How things work now

To engage the community in contributing, here are the key steps in our mcore integration process and features under development.

The huggingface transformers is the de facto standard of model zoo while mcore is good at computation efficiency. The main challenge is conversion between the two. main steps:

  1. modelling the huggingface model with mcore GPTModel

    • a. convert the huggingface config to mcore TransformerConfig
    • b. init the mcore GPTModel with the converted config
    • c. load the huggingface model weights to the GPTModel
  2. online weight conversion from mcore to huggingface (due the the rollout engine vLLM is using huggingface format)

    • a. bridge the gap between mcore and huggingface weights format and name mapping
    • b. online resharding the mcore weights to rollout engine
      • this part is very complicated with multiple parallel strategies composition between mcore and rollout engine
  3. support the mcore features in verl

    • a. support tensor parallel, pipeline parallel, sequence parallel, virtual pipeline parallel, context parallel
    • b. support recompute and other mcore speed up features
  4. checkpointing

    • a. support recovering the verl training.
    • b. support exporting the mcore checkpoint to huggingface format, for downstream inference.

Modelling the huggingface model with mcore GPTModel

The first step is to convert huggingface config to mcore TransformerConfig and init the mcore GPTModel with the converted config. See code in verl/models/mcore/config_converter.py and verl/verl/models/mcore/models/model_initializer.py. The corresponding model forward code is in verl/verl/models/mcore/models/model_forward.py.

There are two ways of loading the huggingface model weights to the GPTModel

  1. Runtime loading
    • every rank loads the entire huggingface model weights and then shard and convert to mcore weights.
    • speed is slow and memory consumption is high.
    • this way is deprecated and will not support new models.
  2. Offline loading
    • use offline script to convert the huggingface model weights to mcore weights and save with mcore dist_checkpointing format.
    • online loading and sharding is automatically done by mcore dist_checkpointing format. The speed is fast and memory consumption is low.
    • the offline script is in verl/scripts/converter_hf_to_mcore.py.

online weight conversion from mcore to huggingface

See function convert_megatron_model_to_transformers_model in verl/utils/megatron_utils.py for the details.

It should be refatored for extensibility and better performance.

support the mcore features in verl

Most of the features of GPTModel is out-of-the-box supported in verl through changing the TransformerConfig, except those about parallel strategies, such as expert parallel. Features about parallel strategies should be supported with changes about the online weights conversion(especially the resharding part) and verl work dispatching.

checkpointing

The existing checkpointing code is in verl/utils/checkpoint/megatron_checkpoint_manager.py. And the script to convert checkpoint to huggingface format is in verl/scripts/model_merger.py.

The existing checkpoint format is simplely save every rank's weights and optimizer states. It should be refactored by dist_checkpointing format.

How to support new models

  1. make sure the model is supported by vLLM
  2. modelling the huggingface model with mcore GPTModel (The Pai-Megatron-Path is a good reference)
    • a. convert the huggingface config to mcore TransformerConfig
    • b. init the mcore GPTModel with the converted config
    • c. load the huggingface model weights to the GPTModel
    • d. for VLM the interface might be different, it is ok to add a new model class with GPTModel as its module.
  3. offline weights conversion from huggingface to mcore dist_checkpointing format
  4. support online weights conversion from mcore to huggingface
    • it is recommended to initilize a vLLM model with the converted mcore weights, and then test if the generating sequence is correct.

How to scale up to larger models like deepseek-v3 or other 100B+ models

The greatest challenge for scaling up to larger models is the memory consumption.

The necessary features under development for scaling up are

  1. Training engine part
    • expert parallel
  2. Rollout engine part
    • pipeline parallel
    • expert parallel
    • more efficient and general weight resharding and loading
  3. Offline weights conversion
    • support weights larger then single GPU memory