File size: 6,294 Bytes
7134ce7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | # Multimodal Models
ms-swift introduces Megatron's parallelization techniques to accelerate the training of large multimodal models. Currently, it supports CPT/SFT/GRPO/DPO/KTO/RM for models such as Qwen3-VL, Qwen3-Omni, Qwen2.5-VL, Qwen2.5-Omni, InternVL3.5, GLM4.5v, Kimi-VL. For a complete list of supported models, please refer to the [Supported Models and Datasets documentation](../Instruction/Supported-models-and-datasets.md).
For environment setup, please refer to the Megatron-SWIFT [Quick Start guide](./Quick-start.md).
## Dense Model
This section demonstrates fine-tuning the Qwen2.5-VL-7B-Instruct model on the LaTeX-OCR task using two 80GiB A100 GPUs, with both full-parameter fine-tuning and LoRA. The best practices described below can be completed within 10 minutes.
### Full
The full-parameter training script is as follows:
```shell
# 2 * 72GiB; 4.1s/it
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=2 \
MAX_PIXELS=1003520 \
CUDA_VISIBLE_DEVICES=0,1 \
megatron sft \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--load_safetensors true \
--save_safetensors true \
--dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
--load_from_cache_file true \
--tensor_model_parallel_size 2 \
--sequence_parallel true \
--packing true \
--freeze_llm false \
--freeze_vit true \
--freeze_aligner true \
--split_dataset_ratio 0.01 \
--micro_batch_size 1 \
--global_batch_size 4 \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-5 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-6 \
--max_epochs 1 \
--save megatron_output/Qwen2.5-VL-7B-Instruct \
--save_interval 200 \
--vit_gradient_checkpointing true \
--max_length 2048 \
--num_workers 4 \
--no_save_optim true \
--no_save_rng true \
--dataset_num_proc 8
```
### LoRA
The LoRA training script is as follows:
```shell
# 2 * 23GiB; 2.3s/it
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=2 \
MAX_PIXELS=1003520 \
CUDA_VISIBLE_DEVICES=0,1 \
megatron sft \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--load_safetensors true \
--save_safetensors true \
--merge_lora false \
--dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
--load_from_cache_file true \
--tuner_type lora \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--tensor_model_parallel_size 1 \
--sequence_parallel true \
--freeze_llm false \
--freeze_vit true \
--freeze_aligner true \
--packing true \
--split_dataset_ratio 0.01 \
--micro_batch_size 1 \
--global_batch_size 4 \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-4 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-5 \
--max_epochs 1 \
--save megatron_output/Qwen2.5-VL-7B-Instruct \
--save_interval 200 \
--vit_gradient_checkpointing true \
--max_length 2048 \
--num_workers 4 \
--no_save_optim true \
--no_save_rng true \
--dataset_num_proc 8
```
Finally, we use the generated Hugging Face format weights to perform inference on the validation set:
```shell
MAX_PIXELS=1003520 \
CUDA_VISIBLE_DEVICES=0 \
swift infer \
--adapters megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx/checkpoint-xxx \
--attn_impl flash_attn \
--stream true \
--load_data_args true \
--temperature 0 \
--max_new_tokens 512
```
The inference results are as follows:
```
[QUERY] Using LaTeX to perform OCR on the image.
[LABELS] \forall x \in X , ( \alpha f ) ( x ) = \alpha f ( x )
[RESPONSE] \forall x \in X , ( \alpha f ) ( x ) = \alpha f ( x )
--------------------------------------------------
[QUERY] Using LaTeX to perform OCR on the image.
[LABELS] \pi \int _ { c } ^ { d } \{ g ( y ) \} ^ { 2 } d y
[RESPONSE] \pi \int _ { c } ^ { d } \{ g ( y ) \} ^ { 2 } d y
--------------------------------------------------
[QUERY] Using LaTeX to perform OCR on the image.
[LABELS] [ \frac 2 3 x ^ { \frac 3 2 } ] _ { 0 } ^ { 1 }
[RESPONSE] [ \frac 2 3 x ^ { \frac 3 2 } ] _ { 0 } ^ { 1 }
```
## MoE Model
Training script:
```bash
# 2 * 43GiB, 8s/it
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
megatron sft \
--model OpenGVLab/InternVL3_5-30B-A3B \
--load_safetensors true \
--save_safetensors true \
--merge_lora false \
--dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
--load_from_cache_file true \
--tuner_type lora \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--sequence_parallel true \
--freeze_llm false \
--freeze_vit true \
--freeze_aligner true \
--packing true \
--split_dataset_ratio 0.01 \
--expert_model_parallel_size 2 \
--moe_permute_fusion true \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--moe_aux_loss_coeff 1e-3 \
--micro_batch_size 1 \
--global_batch_size 4 \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-4 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-5 \
--max_epochs 1 \
--save megatron_output/InternVL3_5-30B-A3B \
--eval_interval 200 \
--save_interval 200 \
--vit_gradient_checkpointing true \
--max_length 2048 \
--num_workers 8 \
--dataset_num_proc 8 \
--no_save_optim true \
--no_save_rng true \
--attention_backend flash
```
After training is completed, we use the generated Hugging Face format weights to perform inference on the validation set:
```shell
CUDA_VISIBLE_DEVICES=0 \
swift infer \
--adapters megatron_output/InternVL3_5-30B-A3B/vx-xxx/checkpoint-xxx \
--attn_impl flash_attn \
--stream true \
--load_data_args true \
--temperature 0 \
--max_new_tokens 512
```
|