A newer version of the Gradio SDK is available:
6.5.1
verl Megatron-Core Models
The earlier versions of verl use Megatron-LM 0.4 and workaround huggingface model classes. To better use the latest features and speedup of modern Megatron, we are migrating to Megatron-Core(mcore), and use the recommended GPTModel class for all language models. With mcore GPTModel, we can use the latest features like context parallel, expert parallel, dist_checkpointing, etc. and we can update mcore with little effort in the future for new features.
The migration has been successful with the help of the mcore team and the community. What we have done is:
- update
Megatronversion to0.11.0 - migrate
LlamaForCausalLMandQwen2ForCausalLMto mcoreGPTModel - support sequence packing/thd format.
- support
tensor parallel,pipeline parallel,sequence parallel,virtual pipeline parallel,context parallel. - support the mcore
dist_checkpointingfeature and a basic offline weighs conversion scipt from huggingface to mcoredist_checkpointingformat.
We are working on the following features:
- support
Qwen2MoeForCausalLM - support
MixtralForCausalLM - support
DeepseekV3ForCausalLM - support
expert parallel
Features we invite the community to contribute:
- better scipts for offline weights conversion from huggingface to mcore
dist_checkpointingformat.- conversion of large models with multiple GPUs
- conversion of large models with single GPU
- refactor the
megatron_checkpoint_manager.pybydist_checkpointingformat. - support llama4
- support qwen2.5-vl
To track the progress of verl mcore integration, please refer to the mcore integration issue.
How things work now
To engage the community in contributing, here are the key steps in our mcore integration process and features under development.
The huggingface transformers is the de facto standard of model zoo while mcore is good at computation efficiency. The main challenge is conversion between the two.
main steps:
modelling the huggingface model with mcore
GPTModel- a. convert the huggingface config to mcore
TransformerConfig - b. init the mcore
GPTModelwith the converted config - c. load the huggingface model weights to the
GPTModel
- a. convert the huggingface config to mcore
online weight conversion from mcore to huggingface (due the the rollout engine
vLLMis using huggingface format)- a. bridge the gap between mcore and huggingface weights format and name mapping
- b. online resharding the mcore weights to rollout engine
- this part is very complicated with multiple parallel strategies composition between mcore and rollout engine
support the mcore features in verl
- a. support
tensor parallel,pipeline parallel,sequence parallel,virtual pipeline parallel,context parallel - b. support recompute and other mcore speed up features
- a. support
checkpointing
- a. support recovering the verl training.
- b. support exporting the mcore checkpoint to huggingface format, for downstream inference.
Modelling the huggingface model with mcore GPTModel
The first step is to convert huggingface config to mcore TransformerConfig and init the mcore GPTModel with the converted config. See code in verl/models/mcore/config_converter.py and verl/verl/models/mcore/models/model_initializer.py. The corresponding model forward code is in verl/verl/models/mcore/models/model_forward.py.
There are two ways of loading the huggingface model weights to the GPTModel
- Runtime loading
- every rank loads the entire huggingface model weights and then shard and convert to mcore weights.
- speed is slow and memory consumption is high.
- this way is deprecated and will not support new models.
- Offline loading
- use offline script to convert the huggingface model weights to mcore weights and save with mcore
dist_checkpointingformat. - online loading and sharding is automatically done by mcore
dist_checkpointingformat. The speed is fast and memory consumption is low. - the offline script is in
verl/scripts/converter_hf_to_mcore.py.
- use offline script to convert the huggingface model weights to mcore weights and save with mcore
online weight conversion from mcore to huggingface
See function convert_megatron_model_to_transformers_model in verl/utils/megatron_utils.py for the details.
It should be refatored for extensibility and better performance.
support the mcore features in verl
Most of the features of GPTModel is out-of-the-box supported in verl through changing the TransformerConfig, except those about parallel strategies, such as expert parallel.
Features about parallel strategies should be supported with changes about the online weights conversion(especially the resharding part) and verl work dispatching.
checkpointing
The existing checkpointing code is in verl/utils/checkpoint/megatron_checkpoint_manager.py. And the script to convert checkpoint to huggingface format is in verl/scripts/model_merger.py.
The existing checkpoint format is simplely save every rank's weights and optimizer states. It should be refactored by dist_checkpointing format.
How to support new models
- make sure the model is supported by vLLM
- modelling the huggingface model with mcore
GPTModel(The Pai-Megatron-Path is a good reference)- a. convert the huggingface config to mcore
TransformerConfig - b. init the mcore
GPTModelwith the converted config - c. load the huggingface model weights to the
GPTModel - d. for VLM the interface might be different, it is ok to add a new model class with GPTModel as its module.
- a. convert the huggingface config to mcore
- offline weights conversion from huggingface to mcore
dist_checkpointingformat - support online weights conversion from mcore to huggingface
- it is recommended to initilize a vLLM model with the converted mcore weights, and then test if the generating sequence is correct.
How to scale up to larger models like deepseek-v3 or other 100B+ models
The greatest challenge for scaling up to larger models is the memory consumption.
The necessary features under development for scaling up are
- Training engine part
- expert parallel
- Rollout engine part
- pipeline parallel
- expert parallel
- more efficient and general weight resharding and loading
- Offline weights conversion
- support weights larger then single GPU memory